Small Datalog Query Rewritings for EL? Giorgio Stefanoni, Boris Motik, and Ian Horrocks Department of Computer Science, University of Oxford 1 Introduction Description Logics are a key technology in data management scenarios such as Ontology-Based Data Access (OBDA), a paradigm in which a DL ontology is used to provide a conceptual view of the data [1]. An OBDA system transforms a conjunctive query over the ontology into a query over the data sources [2]. This transformation is independent of the data, so the OBDA approach can thus be used in settings where the data sources provide read-only access to the data, and where the data changes frequently. Most existing OBDA systems are based on the DL-Lite family of lightweight Description Logics [1], which is also the basis for the QL profile of the OWL 2 ontology language. Logics in this family have been designed to allow a conjunc- tive query posed over the ontology to be rewritten as a first order query over the data sources—that is, queries are first-order rewritable. The query rewriting procedure is independent of the data, and the resulting queries can be evaluated using highly scalable relational database technology. To achieve this, however, the expressive power of DL-Lite is very restrictive. This prevents the OBDA approach from being applied in the life science domain, where many ontologies use DLs from the EL family [3, 4]. This family provides the basis for the EL profile of OWL2, and many prominent ontologies, such as SNOMED-CT, were developed using this language. The problem of answering conjunctive queries in EL has already been studied in the literature, and two orthogonal approaches have been proposed. First, Rosati proposed a pure query rewriting technique which transforms an EL TBox T and a conjunctive query q into a Datalog program PT ,q [5]. Second, Lutz et al. introduced a “combined” approach [6, 7]. This technique first materializes certain facts entailed by the ontology in a precomputation step. Then, each user query is rewritten into a polynomial first-order query that, when evaluated over the materialized facts, computes the answers to the user’s query. Unfortunately, these two approaches exhibit several shortcomings when ap- plied in the context of OBDA. In particular, Rosati’s rewriting technique com- putes for each user query a fresh Datalog program whose size depends on both the query and the terminology, which could be very inefficient when dealing with large scale ontologies. The approach by Lutz et al. produces smaller first order rewritings, but the use of materialization means that the technique is only appli- cable when the data sources provide read/write access to the data; furthermore, materialization can be inefficient if the data changes frequently. ? This work was supported by EPSRC and Alcatel-Lucent. In this paper, we present a pure query rewriting technique to query answering in EL. Our approach reinterprets the combined approach proposed by Lutz and colleagues in terms of Datalog. Our rewriting procedure consists of two distinct steps. The first step rewrites a TBox T into a Datalog program PT , whose size depends linearly on the size of T . Then, at query time, the conjunctive query q is rewritten into a Datalog query hQP , QC i, whose size depends polynomially on q. The two rewriting steps are such that, given an ABox A, deciding whether QP (a1 , . . . , ak ) follows from PT ∪ QC ∪ A is equivalent to deciding whether ha1 , . . . , ak i is a certain answer to q over a knowledge base hT , Ai. At last, we summarize our main contributions. First, our rewriting approach, unlike Rosati’s, separates the rewriting of the TBox and the query into two dis- tinct steps, thus reducing inefficiency when dealing with large ontologies. Second, our technique does not require the materialization of entailed facts, hence our solution is in the spirit of OBDA and it avoids the problems associated with the materialization of large models. Finally, we set the stage for assessing the utility and the applicability to PT ∪ QC ∪ A of optimized Datalog evaluation techniques, such as magic sets and SLG resolution [8, 9]. Indeed, heuristic-based evaluation strategies significantly reduce the number of facts needed to answer a query, thus potentially improving the performance of our rewriting approach. In this paper we provide only proof sketches, and refer the reader to [10] for full proofs. 2 Preliminaries Description Logic EL Let NC , NR , NI be pairwise disjoint infinite sets of atomic concepts, atomic roles, and individuals. Together, the sets NC , NR , and NI form the signature of an EL language. Whenever the distinction between atomic concepts and atomic roles is immaterial, we call an element of NC ∪ NR a predicate. The set of EL concept expressions is inductively defined starting from atomic concepts A ∈ NC and atomic roles R ∈ NR according to the syntax rule: C → A | C1 u C2 | ∃R.C | >. An EL TBox T is a finite set of concept inclusions of the form C v D; an EL ABox A is a finite set of assertions of the form A(a) or R(a, b) with a and b individuals; and an EL knowledge base (KB) is a tuple K = hT , Ai, where T is an EL TBox and A is an EL ABox. We denote with Ind(A) the set of all individuals occurring in the ABox A. Furthermore, for E either a TBox or an ABox, Pred(E) is the set of all predicates occurring in E. Semantics is given as usual in terms of first-order interpretations I = h∆I , ·I i, where ∆I is a nonempty domain and ·I is an interpretation function; please re- fer to [3] for details. In the following, we will extensively use the notion of an unraveling of an interpretation w.r.t. an ABox. Consider an interpretation I and an ABox A over an arbitrary EL signature. A path p in I w.r.t. A is a nonempty finite sequence c1 · R2 · c2 · · · Rn−1 · cn−1 · Rn · cn such that c1 ∈ {aI | a ∈ Ind(A)} I and for all 1 ≤ i ≤ n − 1 we have that hci , ci+1 i ∈ Ri+1 for Ri+1 ∈ NR . We say that a path p has depth n and we write dep(p) = n; furthermore, tail(p) is the last domain element cn in p. Let pathsA (I) denote the set of all paths w.r.t. A occurring in I. The unraveling J of I w.r.t. A is the following interpretation. ∆J = pathsA (I) aJ = aI AJ = {p ∈ pathsA (I) | tail(p) ∈ AI } RJ = {haJ , bJ i | R(a, b) ∈ A} ∪ {hp, p · R · ci | {p, p · R · c} ⊆ pathsA (I)} In this paper, we deal only with normalized EL TBoxes. Let A1 , A, and B be arbitrary concepts from NC ∪ {>}. We say that an EL TBox T is in normal form if each axiom in T is in one of the following forms: A v B, A u A1 v B, A v ∃R.B, or ∃R.A v B. Given an arbitrary EL TBox T , we can compute a normalized TBox Tnorm of T in linear time [3]. Querying EL KBs Let NV be an infinite set of variables disjoint from NI . Together, NV and NI form the set NT of terms. A first-order query q is a first-order formula constructed from the terms in NT and the predicates from NC ∪ NR [8]. In general, we write q = ψ(~x) to express that q is the FO formula ψ whose answer variables are ~x = {x1 , . . . , xk }. A query with k answer variables is a k-ary query. A conjunctive query (CQ) is a FO query of the form q = ∃~y .ψ(~x, ~y ), where ψ is a conjunction of unary atoms A(s) and binary atoms R(s, t) with s and t terms. The variables ~y are the quantified variables of q. In the following, avar(q) is the set of answer variables of q, and qvar(q) is the set of quantified variables. Finally, NV (q) is the set of all variables occurring in q, and NT (q) is the set of all terms occurring in q. Let q = ψ(~x) be a k-ary FO query with ~x = hx1 , . . . , xk i and let I be an interpretation. We say that a k-ary tuple of individuals ha1 , . . . , ak i is an answer to q in I, written I |= q[a1 , . . . , ak ], if I satisfies q under the mapping π which sets π(xi ) = ai for all 1 ≤ i ≤ k. We call π a match for q in I witnessing ha1 , . . . , ak i, written I |=π q. We say that ha1 , . . . , ak i is a certain answer to q over K if I |= q[a1 , . . . , ak ], for all models I of K. We denote the set of all certain answers to q over K with cert(q, K). Rosati in [5] showed that deciding whether a tuple of individuals is a certain answer to q over K is Ptime-complete w.r.t. data-complexity (i.e., w.r.t. the size of the ABox); Ptime-complete w.r.t. KB complexity (i.e., w.r.t. the size of K); and, NP-complete w.r.t. combined complexity (i.e., w.r.t. the size of both K and q). Datalog Let NB be a nonempty set of built-in predicates [11]. Then, a Datalog rule r is an expression of the form S(~u) ← S1 (~u1 ), . . . , Sn (~un ), Bn+1 (~un+1 ) . . . , Bm (~um ), where n, m ≥ 0, {S, S1 , . . . , Sn } ⊆ NC ∪ NR , {Bn+1 , . . . , Bm } ⊆ NB , and ~u and ~ui are tuples of terms of suitable length. A rule is safe if each variable occurring in ~u ∪ ~un+1 ∪ . . . ∪ ~um also occurs in ~u1 ∪ . . . ∪ ~un . Atom S(~u) is the head of the rule, and atoms S1 (~u1 ), . . . , Bm (~um ) constitute the body of the rule. Whenever the body of a rule r is empty, we call r a fact, and we write the rule as S(~u). A Datalog program P is a set of safe Datalog rules. Finally, sch(P ) is the set of predicates occurring in P . Next, we define the semantics of a Datalog program P using Herbrand interpretations [8]. The Herbrand Universe of P is the set of all individuals occurring in P . The Herbrand Base of P is the set of all facts that can be constructed from the predicates in NC ∪ NR and the individuals in the universe of P . A Herbrand interpretation I of P is a subset of the Herbrand Base of P . Note that I does not interpret built-in predicates. As usual, we assume that these predicates are evaluated over a fixed, possibly infinite Herbrand interpretation B [12]. Then I is a model of P w.r.t. B if, for all the rules r in P , we have that I ∪ B |= ∀~x(Bm (~un ) ∧ . . . ∧ Bn+1 (~un+1 ) ∧ Sn (~un ) ∧ . . . ∧ S1 (~u1 ) → S(~u)), where ~x is a tuple consisting of all variables occurring in the rule. The semantics of a Datalog program P is defined as the minimal Herbrand interpretation I satisfying P w.r.t. B, written MB (P ). Whenever the program does not contain built-in predicates, we do not consider the interpretation B and we simply write M(P ). The semantics of Datalog programs can be defined also by means of a fixpoint construction. Then, TP is the immediate consequence operator that maps instances I over sch(P ) to instances over sch(P ) as follows. For each rule r in P , if there exists a match π for S1 (~u1 ) ∧ . . . ∧ Sn (~un ) ∧ Bn+1 (~un+1 ) ∧ . . . ∧ Bm (~um ) in I ∪ B, then S(a1 , . . . , ak ) is contained in TP (I) with ai = π(ui ) for each ui ∈ ~u. One can prove that TP has a minimum fixpoint TPω such that TPω = MB (P ) [8]. Finally, a Datalog query is a tuple hQP , QC i where QP is a predicate symbol and QC is a Datalog program. A tuple of individuals ha1 , . . . , ak i is an answer to hQP , QC i over Datalog program P if P ∪ QC |= QP (a1 , . . . , ak ). 3 Datalog Rewriting for EL TBoxes In this section, we show how to transform an EL TBox T into a Datalog program PT whose size depends linearly on T . The transformation is such that, for an arbitrary EL ABox A, we can use the unraveling of M(PT ∪A) to compute the answers to conjunctive queries over hT , Ai. Let T be a TBox over an arbitrary EL signature. Intuitively, for each axiom α occurring in T , the program PT contains a set of Datalog rules which encode the constraint imposed by α. To achieve this, we have to overcome two issues. First, EL concept inclusions of the form A v ∃R.B require the use of either existential quantifications or Skolem terms in rule heads; however, Datalog does not allow neither of the two. In order to solve this issue, we use a technique that has been introduced for representing canonical models of EL knowledge bases [3]. That is, for each atomic concept B occurring in T we introduce a fresh Axiom α Set of rules Θ(α) AvB B(X) ← A(X) A1 u A2 v B B(X) ← A1 (X), A2 (X) ∃R.A v B B(X) ← R(X, Y ), A(Y ) R(X, oB ) ← A(X) A v ∃R.B B(oB ) ← A(X) Fig. 1. Transformation of EL Axioms into Rules. auxiliary individual oB , which represents the class of existentially quantified individuals of type B. Then, for each axiom of the above form, the program PT contains the following two rules: R(X, oB ) ← A(X); B(oB ) ← A(X). Second, EL allows for > to occur in concept expressions. Hence, we need to define in PT a unary predicate >, whose extension—given an ABox A—coincides with the Herbrand universe of PT ∪ A. To achieve this, we restrict our study to a subset of all EL ABoxes. In particular, we consider only those ABoxes A such that Pred(A) ⊆ Pred(T ). That is, each predicate occurring in the ABox A must occur also in the TBox T . Then, in our Datalog program, for each atomic concept A and for each atomic role R occurring in T , we add the following rules: >(X) ← A(X); >(X) ← R(X, Y ); >(Y ) ← R(X, Y ). This is only one of the several ways in which we can encode such a predicate. In fact, another possibility would be—as suggested by Rosati in [5]—to assume that each ABox A contains an assertion >(a) for each individual a ∈ Ind(A). We believe that in the context of OBDA—where the focus is to provide access to arbitrary data-sources—it is important to make as few assumptions as possible on the physical realization of the ABox. For this reason, we prefer the option presented above. Next, we formalize the transformation of a TBox T into a Datalog program PT . Let Aux = {oA | A ∈ NC } ∪ {o> } be a set of auxiliary individuals distinct from NI . Then, the program PT is constructed from terms in NT ∪ Aux and predicates in NC ∪ NR ∪ {>} as follows. The transformation uses the function Θ, shown in Figure 1, to transform each axiom in the (normalized) TBox T into a set of Datalog rules. The Datalog program PT is then defined as follows. S PT = Sα∈T Θ(α) A∈Pred(T )∩NC >(X) ← A(X) S R∈Pred(T )∩NR >(X) ← R(X, Y ), >(Y ) ← R(X, Y ) The following result readily follows from the definition of the program. Proposition 1. For an arbitrary EL TBox T , Datalog program PT can be computed in time linear in the size of T . Consider an arbitrary EL ABox A. Next, we prove that the unraveling U w.r.t. A of M(PT ∪A) can be used to answer conjunctive queries over K = hT , Ai. We do so in two distinct steps. First, we introduce the notion of chase of an EL knowledge base K. Second, we show that the chase of K is isomorphic to U. The chase of an EL knowledge base K = hT , Ai, written chase(K), is a possibly infinite Herbrand interpretation defined inductively by starting from A and then applying axioms occurring in the TBox to assertions occurring in the ABox. In our definition of the chase, we use function terms to denote existentially quantified individuals. Hence, the definition of ABox assertion is extended in a natural way to accommodate for assertions over function terms. We denote with u and w terms that can be either individuals or function terms. Next, we define an operator ΓT that chases an ABox by applying the axioms occurring in the TBox T . In the definition, we use assertions of the form >(u) to assert that u is a member of the EL concept expression >. For S an arbitrary ABox, ΓT (S) is the smallest ABox containing S and closed under the following chasing rules. (cr1) If {A(u)} ⊆ S and A v B ∈ T , then {B(u)} ⊆ ΓT (S). (cr2) If {A1 (u), A2 (u)} ⊆ S and A1 u A2 v B ∈ T , then {B(u)} ⊆ ΓT (S). (cr3) If {R(u, w), A(w)} ⊆ S and ∃R.A v B ∈ T , then {B(u)} ⊆ ΓT (S). (cr4) If {A(u)} ⊆ S and A v ∃R.B ∈ T , then {R(u, f (u, R, B)), B(f (u, R, B))} ⊆ ΓT (S). (cr5) If u occurs in S, then {>(u)} ⊆ ΓT (S). We now define an infinite sequence of finite ABoxes Ai for i ∈ N. A0 = A Ai+1 = ΓT (Ai ) Finally, the chase of K is the infinite union of all such ABoxes Ai . [ chase(K) = Ai i∈N It is clear that our construction of the chase of K is fair. In fact, for each i ∈ N we have that Ai+1 is the result of exhaustively applying—to all possible assertions occurring in Ai —all applicable axioms in T . At last, we want to point out that chase(K) can be used to compute the certain answers to a CQ q over K. Proposition 2 ([5]). Let K be an EL knowledge base. Further, let q be a k-ary conjunctive query. Then, for each k-ary tuple of individuals ha1 , . . . , ak i, we have ha1 , . . . , ak i ∈ cert(q, K) if and only if chase(K) |= q[a1 , . . . , ak ]. So, by proving that the unraveling U of M(PT ∪A) is isomorphic to chase(K), we establish that U can be used to answer conjunctive queries over K. To prove the structural equivalence of U and chase(K), we define a function h mapping paths occurring in U to terms in chase(K). We define h by induction on the depth of paths occurring in U as follows. Base Case. Consider an arbitrary p ∈ ∆U with dep(p) = 1. We set h(p) := p. Inductive Step. Let p = t1 · R2 · t2 · · · tn−1 · Rn · tn be a path occurring in U such that h(p) has not been defined yet, but h(t1 · · · Rn−1 · tn−1 ) = u. We distinguish between two cases depending on the type of the individual tn . 1. If tn occurs in the ABox, we set h(p) := tn . 2. If tn is of the form oB , we set h(p) := f (u, Rn , B). Theorem 1 shows that h is an isomorphism between the two structures. In- tuitively, for the only-if direction, we show that h is an injective homomorphism from U to chase(K) by induction on the depth of paths occurring in U; for the if-direction, by induction on the construction of chase(K) we prove that h is a surjective function and that it is a homomorphism from chase(K) to U. Theorem 1. Function h is an isomorphism from U to chase(K). Since the unraveling of M(PT ∪ A) is generally infinite, this result alone does not provide us with an algorithm for answering queries in EL. In the next section, we show how to rewrite a user query q into a Datalog query hQP , QC i such that PT ∪ A ∪ QC |= QP (a1 , . . . , ak ) if and only if ha1 , . . . , ak i ∈ cert(q, K) and thus solve the problem. 4 Polynomial Query Rewriting in Datalog In the previous section, we have seen that for an arbitrary EL KB K = hT , Ai evaluating a conjunctive query q over the unraveling of the Herbrand model of PT ∪ A is equivalent to computing the certain answers to q over K. In this section, we develop a query rewriting procedure that reduces the computation of cert(q, K) to the problem of evaluating a suitably constructed Datalog query over PT ∪ A. We achieve this in two steps. First, we present an interesting property of a certain class of interpretations. Second, we show how this result can be used to develop a query rewriting procedure in our Datalog setting. We use the notions of A-connected and split interpretations from [6, 13]. Let I be an interpretation and let A be an ABox over an arbitrary EL signature. We say that I is A-connected if, for each domain element c ∈ ∆I , there exists a path p ∈ pathsA (I) such that tail(p) = c. Furthermore, I is a split interpretation if, for all domain elements c, c0 ∈ ∆I , we have that c 6∈ {aI | a ∈ Ind(A)} and hc, c0 i ∈ RI imply c0 6∈ {aI | a ∈ Ind(A)}. Intuitively, in an A-connected interpretation I, for each domain element cn it is always possible to find a path aI · R2 · c2 · · · Rn · cn such that a ∈ Ind(A). Furthermore, if I is split, then each domain element that is not the image of an individual can be related by a role only with elements that themselves do not interpret individuals. Then, let J be the unraveling w.r.t. A of a split and A-connected interpre- tation I and let q be a conjunctive query. Lutz et al. in [6, 13] showed that it is possible to reduce the problem of answering q in J to evaluating a first-order query rewriting q ∗ of q over I. Roughly speaking, the query rewriting q ∗ rules out some spurious answers for q in I that cannot be reproduced in J . More specifically, we have to ensure that the answer variables of q, the variables of q mapped to cyclic portions of I, and the variables of q mapped to nontree portions of I are all matched only to the domain elements in {aI | a ∈ Ind(A)}. We now briefly outline how we can construct such an FO rewriting q ∗ for q [6]. Let ∼q be the smallest equivalence relation over NT (q) that is closed under the following rule: if R(s, t) and R(s0 , t0 ) occur in q and t ∼q t0 , then s ∼q s0 . Then, for each equivalence class ζ of ∼q , we let tζ ∈ ζ be an arbitrary, but fixed, representative of the class. Also, for each such equivalence class ζ and for each atomic role R occurring in q, we let Pred(ζ, R) be the following set. Pred(ζ, R) = {t ∈ NT (q) | R(t, t0 ) occurs in q with t0 ∈ ζ} Next, we define three sets of terms that correspond to the above mentioned cases. – Fork= is the set of all pairs hPred(ζ, R), tζ i such that ζ is an equivalence class of ∼q and |Pred(ζ, R)| > 1. – Fork6= is the set of all quantified variables v ∈ qvar(q) for which atoms R(s, v) and S(s0 , t) exist in q such that R 6= S and v ∼q t. – Cyc is the set of all variables v ∈ qvar(q) for which atoms R0 (t0 , t00 ), . . . , Rm (tm , t0m ), . . . , Rn (tn , t0n ) exist in q such that m, n ≥ 0; for some i ≤ n we have that ti ∼q v; for each j < n we have that t0j ∼q tj+1 ; and t0n ∼q tm . We are now ready to formally specify the FO query rewriting q ∗ . In the definition, we assume that Aux is a fresh predicate not occurring in q and K and that every interpretation I interprets Aux as ∆I \ {aI | a ∈ Ind(A)}. Then, formulae q1 and q2 are defined as follows. ^ q1 = ¬Aux(v) v∈avar(q)∪Fork6= ∪Cyc ^ ^ q2 = ¬Aux(tζ ) ∨ (t = t0 ) hPred(ζ,R),tζ i∈Fork= t,t0 ∈Pred(ζ,R) Finally, we set q ∗ = q∧q1 ∧q2 . It turns out that q ∗ can be computed in polynomial time w.r.t. q [6]. In the same paper, Lutz et al. prove the following result. Proposition 3. Let A be an arbitrary EL ABox, let I be a split and A-connected interpretation, and let J be the unraveling of I w.r.t. A. Then, for every k-tuple of individuals ha1 , . . . , ak i, we have that I |= q ∗ [a1 . . . , ak ] if and only if J |= q[a1 . . . , ak ]. This result applies to our Datalog rewriting of EL TBoxes. Indeed, for an arbitrary EL KB K = hT , Ai, we have that M(PT ∪A) is a split and A-connected interpretation. The intuition behind the argument is as follows. We show that M(PT ∪ A) is split by noticing that rules encoded in PT do not allow for the derivation of facts of the form R(oB , a) for a ∈ Ind(A) and oB ∈ Aux. To see that M(PT ∪ A) is A-connected, we just recall that M(PT ∪ A) is minimal and, hence, all the derived facts must be “grounded” w.r.t. the facts in A. Theorem 2. Let K = hT , Ai be an EL knowledge base. Then, M(PT ∪ A) is a split and A-connected interpretation. Hence, for an arbitrary k-ary CQ q and for each k-tuple of individuals ha1 , . . . , ak i, we have that M(PT ∪ A) |= q ∗ [a1 . . . , ak ] if and only if ha1 , . . . , ak i ∈ cert(q, K). Note that q ∗ is a first-order query, and we are unaware of systems capa- ble of evaluating first-order queries over Datalog programs. Therefore, we next show how to transform q ∗ into a Datalog query hQP , QC i such that ha1 , . . . , ak i ∈ cert(q, K) if and only if PT ∪ A ∪ QC |= QP (a1 . . . , ak ). We con- struct such a Datalog query hQP , QC i by applying to q ∗ a simplified version of the Lloyd-Topor transformation [14, 15]. Definition 1 (Datalog Rewriting). Let q(~x) be a k-ary CQ whose quanti- fied variables are among ~y ; let Cyc, Fork6= , and Fork= be as specified above; let hPred(ζ 1 , R1 ), t1ζ i, . . ., hPred(ζ n , Rn ), tnζ i be an arbitrary enumeration of Fork= ; let p0 , p1 , . . . , pn be fresh predicates; and let Named be a built-in with a prede- termined, possibly infinite Herbrand interpretation N = {Named(a) | a ∈ NI }. Query QC then contains the following safe Datalog rules: ^ p0 (~x, ~y ) ← q, Named(v) (1) v∈avar(q)∪Fork6= ∪Cyc pi (~x, ~y ) ← pi−1 (~x, ~y ), Named(tiζ ) for 1 ≤ i ≤ n (2) ^ pi (~x, ~y ) ← pi−1 (~x, ~y ), t = t0 for 1 ≤ i ≤ n (3) t,t0 ∈Pred(ζ i ,Ri ) QP (~x) ← pn (~x, ~y ) (4) One may think that the recursive definition of predicates pi for 1 ≤ i ≤ n could be simplified by writing QP (~x) ← p0 (~x, ~y ) . . . pn (~x, ~y ) and by defining each pi as: pi (~x, ~y ) ← Named(tiζ ) pi (~x, ~y ) ← t,t0 ∈Pred(ζ i ,Ri ) t = t0 V Unfortunately, these rules are not safe. Safe rules, on the one hand, provide us with a clear and unambiguous semantics. On the other hand, unsafe rules are also computationally more expensive for bottom-up computation, since each variable in the head may be bound to an arbitrary individual in the universe of the program. For this reason, we prefer our, slightly more involved, solution. Proposition 4. For an arbitrary k-ary conjunctive query q, query hQP , QC i can be computed in polynomial time w.r.t. the size of q. Proof. We note that ∼q can be computed in polynomial time w.r.t. the size of q [6] and, therefore, also the sets Cyc, Fork6= , and Fork= can be computed in poly- nomial time w.r.t. q. Furthermore, the size of the body of rule p0 (~x, ~y ) depends linearly on the size of q, Cyc, and Fork6= . Also, for each pair hPred(ζ, R), tζ i in Fork= , the program QC contains exactly two rules. The size of these two rules depends linearly on the size of hPred(ζ, R), tζ i. Thus, we conclude that hQP , QC i can be computed in polynomial time with respect to the size of q. t u In [10], we prove that our rewriting is correct—that is, that answering hQP , QC i over PT ∪ A is equivalent to computing cert(q, T , A). This follows from Propo- sition 3 and the fact that our Datalog query is the result of transforming the FO rewriting q ∗ along the lines of the Lloyd-Topor transformation. Theorem 3. Let K be an EL knowledge base and let q be a k-ary CQ over K. Then, for every k-tuple of individuals ha1 , . . . , ak i, we have that ha1 , . . . , ak i ∈ cert(q, K) if and only if PT ∪ A ∪ QC |= QP (a1 , . . . , ak ). Finally, we investigate the complexity of our rewriting procedure. Theorem 4. Let K = hT , Ai be an EL KB, let q be a k-ary CQ, and let ha1 , . . . , ak i be a tuple of individuals. We can decide PT ∪A∪QC |= QP (a1 , . . . , ak ) in polynomial time w.r.t. the size of K and in non-deterministic polynomial time with respect to the size of both K and q. Proof. We have already argued that the size of Datalog program PT depends linearly on the size of the TBox T and that the Datalog rewriting hQP , QC i can be computed in Ptime w.r.t. q. Also, we note that the arity of predicates and the number of variables occurring in PT ∪ A ∪ QC do not depend on K. Finally, from an implementation point-of-view (as suggested in [12]), the built-in predicate Named can be considered as an assertion in the ABox A with a different physical realization: it is not directly stored in the ABox but it is implemented as a procedure which is evaluated during the execution of the program. Clearly, such a procedure can be implemented to run in time polynomial in K. It follows that we can compute the minimal Herbrand model of PT ∪ A ∪ QC in time polynomial in the size of K [8]. The membership in NP follows directly from the considerations above and from the fact that we can guess and check in nondeterministic polynomial time a match π for QP in M(PT ∪ A ∪ QC ). t u 5 Conclusions In this paper, we introduce a query rewriting approach to query answering in EL. In our approach, the process of computing the certain answers to a CQ q over an EL KB K = hT , Ai is divided into two distinct steps. A first preprocessing step in which the TBox T is transformed into a Datalog program PT , whose size is linear in T . Then, at query time, the query q is independently rewritten into a Datalog query hQP , QC i, whose size is polynomial in q. Finally, computing cert(q, K) amounts to evaluating the Datalog query hQP , QC i over PT ∪ A. dr In future, we plan to extend our approach to deal with ELH⊥ . Lutz et al. have already proposed a combined approach for query answering in this DL [13]. However, differently from their solution, we would like the Datalog rewriting hQP , QC i to be independent from the role inclusions contained in the TBox. Additionally, we plan to extend our work to cover nominals, which raises the question on how to efficiently handle equality in Datalog [2]. References 1. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A., Rodrı́guez- Muro, M., Rosati, R.: Ontologies and databases: the DL-Lite approach. In Tessaris, S., Franconi, E., eds.: Semantic Technologies for Informations Systems – 5th Int. Reasoning Web Summer School (RW 2009). Volume 5689. (2009) 255–356 2. Pérez-Urbina, H., Motik, B., Horrocks, I.: Tractable query answering and rewriting under description logic constraints. J. Applied Logic 8(2) (2010) 186–209 3. Baader, F., Brandt, S., Lutz, C.: Pushing the EL envelope. In Kaelbling, L.P., Saffiotti, A., eds.: IJCAI, Professional Book Center (2005) 364–369 4. Baader, F., Brandt, S., Lutz, C.: Pushing the EL envelope further. In Clark, K., Patel-Schneider, P.F., eds.: In Proceedings of the OWLED 2008 DC Workshop on OWL: Experiences and Directions. (2008) 5. Rosati, R.: On conjunctive query answering in EL. In Calvanese, D., Franconi, E., Haarslev, V., Lembo, D., Motik, B., Turhan, A.Y., Tessaris, S., eds.: Description Logics. Volume 250 of CEUR Workshop Proceedings., CEUR-WS.org (2007) 6. Lutz, C., Toman, D., Wolter, F.: Conjunctive query answering in EL using a database system. In: Proceedings of the OWLED 2008 Workshop on OWL: Expe- riences and Directions. (2008) 7. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The com- bined approach to ontology-based data access. In: IJCAI, AAAI Press (2011) 8. Abiteboul, S., Hull, R., Vianu, V.: Foundations of databases. Addison-Wesley (1995) 9. Chen, W., Warren, D.S.: Query evaluation under the well-founded semantics. In: Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. PODS ’93, New York, NY, USA, ACM (1993) 168–179 10. Stefanoni, G., Motik, B., Horrocks, I.: Small datalog query rewrit- ing for EL. Technical report, University of Oxford (2012) Available at http://www.cs.ox.ac.uk/people/giorgio.stefanoni/pubs/2012/tr/dl2012.pdf. 11. Ullman, J.D.: Principles of database and knowledge-base systems, Vol. I. Computer Science Press, Inc., New York, NY, USA (1988) 12. Ceri, S., Gottlob, G., Tanca, L.: What you always wanted to know about datalog (and never dared to ask). IEEE Trans. Knowl. Data Eng. 1(1) (1989) 146–166 13. Lutz, C., Toman, D., Wolter, F.: Conjunctive query answering in the description logic EL using a relational database system. In Boutilier, C., ed.: IJCAI. (2009) 2070–2075 14. Lloyd, J., Topor, R.: Making prolog more expressive. The Journal of Logic Pro- gramming 1(3) (1984) 225 – 240 15. Decker, S.: Semantic web methods for knowledge managmement. PhD thesis, University of Karlsruhe (2002)