Small Datalog Query Rewritings for EL

Introduction

Description Logics are a key technology in data management scenarios such as Ontology-Based Data Access (OBDA), a paradigm in which a DL ontology is used to provide a conceptual view of the data [1]. An OBDA system transforms a conjunctive query over the ontology into a query over the data sources [2]. This transformation is independent of the data, so the OBDA approach can thus be used in settings where the data sources provide read-only access to the data, and where the data changes frequently.

Most existing OBDA systems are based on the DL-Lite family of lightweight Description Logics [1], which is also the basis for the QL profile of the OWL 2 ontology language. Logics in this family have been designed to allow a conjunctive query posed over the ontology to be rewritten as a first order query over the data sources-that is, queries are first-order rewritable. The query rewriting procedure is independent of the data, and the resulting queries can be evaluated using highly scalable relational database technology. To achieve this, however, the expressive power of DL-Lite is very restrictive. This prevents the OBDA approach from being applied in the life science domain, where many ontologies use DLs from the EL family [3,4]. This family provides the basis for the EL profile of OWL2, and many prominent ontologies, such as SNOMED-CT, were developed using this language.

The problem of answering conjunctive queries in EL has already been studied in the literature, and two orthogonal approaches have been proposed. First, Rosati proposed a pure query rewriting technique which transforms an EL TBox T and a conjunctive query q into a Datalog program P T ,q [5]. Second, Lutz et al. introduced a "combined" approach [6,7]. This technique first materializes certain facts entailed by the ontology in a precomputation step. Then, each user query is rewritten into a polynomial first-order query that, when evaluated over the materialized facts, computes the answers to the user's query.

Unfortunately, these two approaches exhibit several shortcomings when applied in the context of OBDA. In particular, Rosati's rewriting technique computes for each user query a fresh Datalog program whose size depends on both the query and the terminology, which could be very inefficient when dealing with large scale ontologies. The approach by Lutz et al. produces smaller first order rewritings, but the use of materialization means that the technique is only applicable when the data sources provide read/write access to the data; furthermore, materialization can be inefficient if the data changes frequently.

This work was supported by EPSRC and Alcatel-Lucent.

In this paper, we present a pure query rewriting technique to query answering in EL. Our approach reinterprets the combined approach proposed by Lutz and colleagues in terms of Datalog. Our rewriting procedure consists of two distinct steps. The first step rewrites a TBox T into a Datalog program P T , whose size depends linearly on the size of T . Then, at query time, the conjunctive query q is rewritten into a Datalog query Q P , Q C , whose size depends polynomially on q. The two rewriting steps are such that, given an ABox A, deciding whether Q P (a 1 , . . . , a k ) follows from P T ∪ Q C ∪ A is equivalent to deciding whether a 1 , . . . , a k is a certain answer to q over a knowledge base T , A .

At last, we summarize our main contributions. First, our rewriting approach, unlike Rosati's, separates the rewriting of the TBox and the query into two distinct steps, thus reducing inefficiency when dealing with large ontologies. Second, our technique does not require the materialization of entailed facts, hence our solution is in the spirit of OBDA and it avoids the problems associated with the materialization of large models. Finally, we set the stage for assessing the utility and the applicability to P T ∪ Q C ∪ A of optimized Datalog evaluation techniques, such as magic sets and SLG resolution [8,9]. Indeed, heuristic-based evaluation strategies significantly reduce the number of facts needed to answer a query, thus potentially improving the performance of our rewriting approach. In this paper we provide only proof sketches, and refer the reader to [10] for full proofs.

Preliminaries

Description Logic EL

Let N C , N R , N I be pairwise disjoint infinite sets of atomic concepts, atomic roles, and individuals. Together, the sets N C , N R , and N I form the signature of an EL language. Whenever the distinction between atomic concepts and atomic roles is immaterial, we call an element of N C ∪ N R a predicate. The set of EL concept expressions is inductively defined starting from atomic concepts A ∈ N C and atomic roles R ∈ N R according to the syntax rule:

C → A | C 1 C 2 | ∃R.C | .

An EL TBox T is a finite set of concept inclusions of the form C D; an EL ABox A is a finite set of assertions of the form A(a) or R(a, b) with a and b individuals; and an EL knowledge base (KB) is a tuple K = T , A , where T is an EL TBox and A is an EL ABox. We denote with Ind(A) the set of all individuals occurring in the ABox A. Furthermore, for E either a TBox or an ABox, Pred(E) is the set of all predicates occurring in E.

Semantics is given as usual in terms of first-order interpretations I = ∆ I , • I , where ∆ I is a nonempty domain and • I is an interpretation function; please refer to [3] for details. In the following, we will extensively use the notion of an unraveling of an interpretation w.r.t. an ABox. Consider an interpretation I and an ABox A over an arbitrary EL signature. A path p in I w.r.t. A is a nonempty finite sequence

c 1 • R 2 • c 2 • • • R n−1 • c n−1 • R n • c n such that c 1 ∈ {a I | a ∈ Ind(A)} and for all 1 ≤ i ≤ n − 1 we have that c i , c i+1 ∈ R I i+1 for R i+1 ∈ N R .∆ J = paths A (I) a J = a I A J = {p ∈ paths A (I) | tail(p) ∈ A I } R J = { a J , b J | R(a, b) ∈ A} ∪ { p, p • R • c | {p, p • R • c} ⊆ paths A (I)}

In this paper, we deal only with normalized EL TBoxes. Let A 1 , A, and B be arbitrary concepts from N C ∪ { }. We say that an EL TBox T is in normal form if each axiom in T is in one of the following forms:

A B, A A 1 B, A ∃R.B, or ∃R.A B.

Given an arbitrary EL TBox T , we can compute a normalized TBox T norm of T in linear time [3].

Querying EL KBs

Let N V be an infinite set of variables disjoint from N I . Together, N V and N I form the set N T of terms. A first-order query q is a first-order formula constructed from the terms in N T and the predicates from N C ∪ N R [8]. In general, we write q = ψ( x) to express that q is the FO formula ψ whose answer variables are x = {x 1 , . . . , x k }. A query with k answer variables is a k-ary query. A conjunctive query (CQ) is a FO query of the form q = ∃ y.ψ( x, y), where ψ is a conjunction of unary atoms A(s) and binary atoms R(s, t) with s and t terms. The variables y are the quantified variables of q. In the following, avar(q) is the set of answer variables of q, and qvar(q) is the set of quantified variables. Finally, N V (q) is the set of all variables occurring in q, and N T (q) is the set of all terms occurring in q. Let q = ψ( x) be a k-ary FO query with x = x 1 , . . . , x k and let I be an interpretation. We say that a k-ary tuple of individuals a 1 , . . . , a k is an answer to q in I, written I |= q[a 1 , . . . , a k ], if I satisfies q under the mapping π which sets π(x i ) = a i for all 1 ≤ i ≤ k. We call π a match for q in I witnessing a 1 , . . . , a k , written I |= π q. We say that a 1 , . . . , a k is a certain answer to q over K if I |= q[a 1 , . . . , a k ], for all models I of K. We denote the set of all certain answers to q over K with cert(q, K). Rosati in [5] showed that deciding whether a tuple of individuals is a certain answer to q over K is Ptime-complete w.r.t. data-complexity (i.e., w.r.t. the size of the ABox); Ptime-complete w.r.t. KB complexity (i.e., w.r.t. the size of K); and, NP-complete w.r.t. combined complexity (i.e., w.r.t. the size of both K and q).

Datalog

Let N B be a nonempty set of built-in predicates [11]. Then, a Datalog rule r is an expression of the form

S( u) ← S 1 ( u 1 ), . . . , S n ( u n ), B n+1 ( u n+1 ) . . . , B m ( u m ),

where n, m ≥ 0, {S, S 1 , . . . , S n } ⊆ N C ∪ N R , {B n+1 , . . . , B m } ⊆ N B , and u and u i are tuples of terms of suitable length. A rule is safe if each variable occurring in u ∪ u n+1 ∪ . . . ∪ u m also occurs in u 1 ∪ . . . ∪ u n . Atom S( u) is the head of the rule, and atoms S 1 ( u 1 ), . . . , B m ( u m ) constitute the body of the rule. Whenever the body of a rule r is empty, we call r a fact, and we write the rule as S( u). A Datalog program P is a set of safe Datalog rules. Finally, sch(P ) is the set of predicates occurring in P .

Next, we define the semantics of a Datalog program P using Herbrand interpretations [8]. The Herbrand Universe of P is the set of all individuals occurring in P . The Herbrand Base of P is the set of all facts that can be constructed from the predicates in N C ∪ N R and the individuals in the universe of P . A Herbrand interpretation I of P is a subset of the Herbrand Base of P . Note that I does not interpret built-in predicates. As usual, we assume that these predicates are evaluated over a fixed, possibly infinite Herbrand interpretation B [12]. Then I is a model of P w.r.t. B if, for all the rules r in P , we have that

I ∪ B |= ∀ x(B m ( u n ) ∧ . . . ∧ B n+1 ( u n+1 ) ∧ S n ( u n ) ∧ . . . ∧ S 1 ( u 1 ) → S( u)),

where x is a tuple consisting of all variables occurring in the rule. The semantics of a Datalog program P is defined as the minimal Herbrand interpretation I satisfying P w.r.t. B, written M B (P ). Whenever the program does not contain built-in predicates, we do not consider the interpretation B and we simply write M(P ). The semantics of Datalog programs can be defined also by means of a fixpoint construction. Then, T P is the immediate consequence operator that maps instances I over sch(P ) to instances over sch(P ) as follows. For each rule r in P , if there exists a match π for S 1 ( u 1 ) ∧ . . .

∧ S n ( u n ) ∧ B n+1 ( u n+1 ) ∧ . . . ∧ B m ( u m ) in I ∪ B, then S(a 1 , . . . , a k ) is contained in T P (I) with a i = π(u i ) for each u i ∈ u.

One can prove that T P has a minimum fixpoint T ω P such that T ω P = M B (P ) [8]. Finally, a Datalog query is a tuple

Q P , Q C where Q P is a predicate symbol and Q C is a Datalog program. A tuple of individuals a 1 , . . . , a k is an answer to Q P , Q C over Datalog program P if P ∪ Q C |= Q P (a 1 , . . . , a k ).

Datalog Rewriting for EL TBoxes

In this section, we show how to transform an EL TBox T into a Datalog program P T whose size depends linearly on T . The transformation is such that, for an arbitrary EL ABox A, we can use the unraveling of M(P T ∪A) to compute the answers to conjunctive queries over T , A . Let T be a TBox over an arbitrary EL signature. Intuitively, for each axiom α occurring in T , the program P T contains a set of Datalog rules which encode the constraint imposed by α. To achieve this, we have to overcome two issues.

First, EL concept inclusions of the form A ∃R.B require the use of either existential quantifications or Skolem terms in rule heads; however, Datalog does not allow neither of the two. In order to solve this issue, we use a technique that has been introduced for representing canonical models of EL knowledge bases [3]. That is, for each atomic concept B occurring in T we introduce a fresh Second, EL allows for to occur in concept expressions. Hence, we need to define in P T a unary predicate , whose extension-given an ABox A-coincides with the Herbrand universe of P T ∪ A. To achieve this, we restrict our study to a subset of all EL ABoxes. In particular, we consider only those ABoxes A such that Pred(A) ⊆ Pred(T ). That is, each predicate occurring in the ABox A must occur also in the TBox T . Then, in our Datalog program, for each atomic concept A and for each atomic role R occurring in T , we add the following rules:

Axiom α Set of rules Θ(α) B B(X) ← A(X) A1 A2 B B(X) ← A1(X), A2(X) ∃R.A B B(X) ← R(X, Y ), A(Y ) A ∃R.B R(X, oB) ← A(X) B(oB) ← A(X)(X) ← A(X); (X) ← R(X, Y ); (Y ) ← R(X, Y ).

This is only one of the several ways in which we can encode such a predicate. In fact, another possibility would be-as suggested by Rosati in [5]-to assume that each ABox A contains an assertion (a) for each individual a ∈ Ind(A). We believe that in the context of OBDA-where the focus is to provide access to arbitrary data-sources-it is important to make as few assumptions as possible on the physical realization of the ABox. For this reason, we prefer the option presented above. Next, we formalize the transformation of a TBox T into a Datalog program P T . Let Aux = {o A | A ∈ N C } ∪ {o } be a set of auxiliary individuals distinct from N I . Then, the program P T is constructed from terms in N T ∪ Aux and predicates in N C ∪ N R ∪ { } as follows. The transformation uses the function Θ, shown in Figure 1, to transform each axiom in the (normalized) TBox T into a set of Datalog rules. The Datalog program P T is then defined as follows.

P T = α∈T Θ(α) A∈Pred(T )∩N C (X) ← A(X) R∈Pred(T )∩N R (X) ← R(X, Y ), (Y ) ← R(X, Y )

The following result readily follows from the definition of the program.

Proposition 1. For an arbitrary EL TBox T , Datalog program P T can be computed in time linear in the size of T .

an arbitrary EL ABox A. Next, we prove that the unraveling U w.r.t. A of M(P T ∪A) can be used to answer conjunctive queries over K = T , A . We do so in two distinct steps. First, we introduce the notion of chase of an EL knowledge base K. Second, we show that the chase of K is isomorphic to U.

The chase of an EL knowledge base K = T , A , written chase(K), is a possibly infinite Herbrand interpretation defined inductively by starting from A and then applying axioms occurring in the TBox to assertions occurring in the ABox. In our definition of the chase, we use function terms to denote existentially quantified individuals. Hence, the definition of ABox assertion is extended in a natural way to accommodate for assertions over function terms. We denote with u and w terms that can be either individuals or function terms. Next, we define an operator Γ T that chases an ABox by applying the axioms occurring in the TBox T . In the definition, we use assertions of the form (u) to assert that u is a member of the EL concept expression . For S an arbitrary ABox, Γ T (S) is the smallest ABox containing S and closed under the following chasing rules.

(cr1) If {A(u)} ⊆ S and A B ∈ T , then {B(u)} ⊆ Γ T (S). (cr2) If {A 1 (u), A 2 (u)} ⊆ S and A 1 A 2 B ∈ T , then {B(u)} ⊆ Γ T (S). (cr3) If {R(u, w), A(w)} ⊆ S and ∃R.A B ∈ T , then {B(u)} ⊆ Γ T (S). (cr4) If {A(u)} ⊆ S and A ∃R.B ∈ T , then {R(u, f (u, R, B)), B(f (u, R, B))} ⊆ Γ T (S). (cr5) If u occurs in S, then { (u)} ⊆ Γ T (S).

We now define an infinite sequence of finite ABoxes A i for i ∈ N.

A 0 = A A i+1 = Γ T (A i )

Finally, the chase of K is the infinite union of all such ABoxes A i .

chase(K)= i∈N A i

It is clear that our construction of the chase of K is fair. In fact, for each i ∈ N we have that A i+1 is the result of exhaustively applying-to all possible assertions occurring in A i -all applicable axioms in T . At last, we want to point out that chase(K) can be used to compute the certain answers to a CQ q over K.

Proposition 2 ([5]

). Let K be an EL knowledge base. Further, let q be a k-ary conjunctive query. Then, for each k-ary tuple of individuals a 1 , . . . , a k , we have

a 1 , . . . , a k ∈ cert(q, K) if and only if chase(K) |= q[a 1 , . . . , a k ].

So, by proving that the unraveling U of M(P T ∪A) is isomorphic to chase(K), we establish that U can be used to answer conjunctive queries over K. To prove the structural equivalence of U and chase(K), we define a function h mapping paths occurring in U to terms chase(K). We define h by induction on the depth of paths occurring in U as follows.

Base Case. Consider an arbitrary p ∈ ∆ U with dep(p) = 1. We set h(p) := p.

Inductive Step. Let p = t 1 • R 2 • t 2 • • • t n−1 • R n • t n be a path occurring in U such that h(p) has not been defined yet, but h(t 1 • • • R n−1 • t n−1 ) = u.

We distinguish between two cases depending on the type of the individual t n .

1. If t n occurs in the ABox, we set h(p

) := t n . 2. If t n is of the form o B , we set h(p) := f (u, R n , B).

Theorem 1 shows that h is an isomorphism between the two structures. Intuitively, for the only-if direction, we show that h is an injective homomorphism from U to chase(K) by induction on the depth of paths occurring in U; for the if-direction, by induction on the construction of chase(K) we prove that h is a surjective function and that it is a homomorphism from chase(K) to U. Theorem 1. Function h is an isomorphism from U to chase(K).

Since the unraveling of M(P T ∪ A) is generally infinite, this result alone does not provide us with an algorithm for answering queries in EL. In the next section, we show how to rewrite a user query q into a Datalog query

Q P , Q C such that P T ∪ A ∪ Q C |= Q P (a 1 , .

. . , a k ) if and only if a 1 , . . . , a k ∈ cert(q, K) and thus solve the problem.

Polynomial Query Rewriting in Datalog

In the previous section, we have seen that for an arbitrary EL KB K = T , A evaluating a conjunctive query q over the unraveling of the Herbrand model of P T ∪ A is equivalent to computing the certain answers to q over K. In this section, we develop a query rewriting procedure that reduces the computation of cert(q, K) to the problem of evaluating a suitably constructed Datalog query over P T ∪ A. We achieve this in two steps. First, we present an interesting property of a certain class of interpretations. Second, we show how this result can be used to develop a query rewriting procedure in our Datalog setting.

We use the notions of A-connected and split interpretations from [6,13]. Let I be an interpretation and let A be an ABox over an arbitrary EL signature. We say that I is A-connected if, for each domain element c ∈ ∆ I , there exists a path p ∈ paths A (I) such that tail(p) = c. Furthermore, I is a split interpretation if, for all domain elements c, c ∈ ∆ I , we have that c ∈ {a

I | a ∈ Ind(A)} and c, c ∈ R I imply c ∈ {a I | a ∈ Ind(A)}. Intuitively, in an A-connected interpretation I, for each domain element c n it is always possible to find a path a I • R 2 • c 2 • • • R n • c n such that a ∈ Ind(A).

Furthermore, if I is split, then each domain element that is not the image of an individual can be related by a role only with elements that themselves do not interpret individuals.

Then, let J be the unraveling w.r.t. A of a split and A-connected interpretation I and let q be a conjunctive query. Lutz et al. in [6,13] showed that it is possible to reduce the problem of answering q in J to evaluating a first-order query rewriting q * of q over I. Roughly speaking, the query rewriting q * rules some spurious answers for q in I that cannot be reproduced in J . More specifically, we have to ensure that the answer variables of q, the variables of q mapped to cyclic portions of I, and the variables of q mapped to nontree portions of I are all matched only to the domain elements in {a I | a ∈ Ind(A)}.

We now briefly outline how we can construct such an FO rewriting q * for q [6]. Let ∼ q be the smallest equivalence relation over N T (q) that is closed under the following rule: if R(s, t) and R(s , t ) occur in q and t ∼ q t , then s ∼ q s . Then, for each equivalence class ζ of ∼ q , we let t ζ ∈ ζ be an arbitrary, but fixed, representative of the class. Also, for each such equivalence class ζ and for each atomic role R occurring in q, we let Pred(ζ, R) be the following set.

Pred(ζ, R) = {t ∈ N T (q) | R(t, t ) occurs in q with t ∈ ζ}

Next, we define three sets of terms that correspond to the above mentioned cases.

-Fork = is the set of all pairs Pred(ζ, R), t ζ such that ζ is an equivalence class of ∼ q and |Pred(ζ, R)| > 1.

-Fork = is the set of all quantified variables v ∈ qvar(q) for which atoms R(s, v) and S(s , t) exist in q such that R = S and v ∼ q t. -Cyc is the set of all variables v ∈ qvar(q) for which atoms

R 0 (t 0 , t 0 ), . . . , R m (t m , t m ), . . . , R n (t n , t n )

exist in q such that m, n ≥ 0; for some i ≤ n we have that t i ∼ q v; for each j < n we have that t j ∼ q t j+1 ; and t n ∼ q t m . We are now ready to formally specify the FO query rewriting q * . In the definition, we assume that Aux is a fresh predicate not occurring in q and K and that every interpretation I interprets Aux as ∆ I \ {a I | a ∈ Ind(A)}. Then, formulae q 1 and q 2 are defined as follows.

q 1 = v∈avar(q)∪Fork = ∪Cyc ¬Aux(v) q 2 = Pred(ζ,R),t ζ ∈Fork= ¬Aux(t ζ ) ∨ t,t ∈Pred(ζ,R) (t = t )

Finally, we set q * = q∧q 1 ∧q 2 . It turns out that q * can be computed in polynomial time w.r.t. q [6]. In the same paper, Lutz et al. prove the following result. Proposition 3. Let A be an arbitrary EL ABox, let I be a split and A-connected interpretation, and let J be the unraveling of I w.r.t. A. Then, for every k-tuple of individuals a 1 , . . . , a k , we have that

I |= q * [a 1 . . . , a k ] if and only if J |= q[a 1 . . . , a k ].

This result applies to our Datalog rewriting of EL TBoxes. Indeed, for an arbitrary EL KB K = T , A , we have that M(P T ∪A) is a split and A-connected interpretation. The intuition behind the argument is as follows. We show that M(P T ∪ A) is split by noticing that rules encoded in P T do not allow for the derivation of facts of the form R(o B , a) for a ∈ Ind(A) and o B ∈ Aux. To see that M(P T ∪ A) is A-connected, we just recall that M(P T ∪ A) is minimal and, hence, all the derived facts must be "grounded" w.r.t. the facts in A.

Theorem 2. Let K = T A be an EL knowledge base. Then, M(P T ∪ A) is a split and A-connected interpretation.

Hence, for an arbitrary k-ary CQ q and for each k-tuple of individuals a 1 , . . . , a k , we have that M(P T ∪ A) |= q * [a 1 . . . , a k ] if and only if a 1 , . . . , a k ∈ cert(q, K).

Note that q * is a first-order query, and we are unaware of systems capable of evaluating first-order queries over Datalog programs. Therefore, we next show how to transform q * into a Datalog query Q P , Q C such that a 1 , . . . , a k ∈ cert(q, K) if and only if P T ∪ A ∪ Q C |= Q P (a 1 . . . , a k ). We construct such a Datalog query Q P , Q C by applying to q * a simplified version of the Lloyd-Topor transformation [14,15]. Definition 1 (Datalog Rewriting). Let q( x) be a k-ary CQ whose quantified variables are among y; let Cyc, Fork = , and Fork = be as specified above; let

Pred(ζ 1 , R 1 ), t 1 ζ , . . ., Pred(ζ n , R n ), t n

ζ be an arbitrary enumeration of Fork = ; let p 0 , p 1 , . . . , p n be fresh predicates; and let Named be a built-in with a predetermined, possibly infinite Herbrand interpretation N = {Named(a) | a ∈ N I }. Query Q C then contains the following safe Datalog rules:

p 0 ( x, y) ← q, v∈avar(q)∪Fork = ∪Cyc Named(v)(1)p i ( x, y) ← p i−1 ( x, y), Named(t i ζ ) for 1 ≤ i ≤ n(2)p i ( x, y) ← p i−1 ( x, y), t,t ∈Pred(ζ i ,R i ) t = t for 1 ≤ i ≤ n(3)Q P ( x) ← p n ( x, y)(4)

One may think that the recursive definition of predicates p i for 1 ≤ i ≤ n could be simplified by writing Q P ( x) ← p 0 ( x, y) . . . p n ( x, y) and by defining each p i as:

p i ( x, y) ← Named(t i ζ ) p i ( x, y) ← t,t ∈Pred(ζ i ,R i ) t = t

Unfortunately, these rules are not safe. Safe rules, on the one hand, provide us with a clear and unambiguous semantics. On the other hand, unsafe rules are also computationally more expensive for bottom-up computation, since each variable in the head may be bound to an arbitrary individual in the universe of the program. For this reason, we prefer our, slightly more involved, solution.

Proposition 4. For an arbitrary k-ary conjunctive query q, query Q P , Q C can be computed in polynomial time w.r.t. the size of q.

Proof. We note that ∼ q can be computed in polynomial time w.r.t. the size of q [6] and, therefore, also the sets Cyc, Fork = , and Fork = can be computed in polynomial time w.r.t. q. Furthermore, the size of the body of rule p 0 ( x, y) depends linearly on the size of q, Cyc, and Fork = . Also, for each pair Pred(ζ, R), t ζ in Fork = , the program Q C contains exactly two rules. The size of these two rules depends linearly on the size of Pred(ζ, R), t ζ . Thus, we conclude that Q P , Q C can be computed in polynomial time with respect to the size of q.

[10], we prove that our rewriting is correct-that is, that answering Q P , Q C over P T ∪ A is equivalent to computing cert(q, T , A). This follows from Proposition 3 and the fact that our Datalog query is the result of transforming the FO rewriting q * along the lines of the Lloyd-Topor transformation. Theorem 3. Let K be an EL knowledge base and let q be a k-ary CQ over K. Then, for every k-tuple of individuals a 1 , . . . , a k , we have that a 1 , . . . , a k ∈ cert(q, K) if and only if

P T ∪ A ∪ Q C |= Q P (a 1 , . . . , a k ).

Finally, we investigate the complexity of our rewriting procedure. Theorem 4. Let K = T , A be an EL KB, let q be a k-ary CQ, and let a 1 , . . . , a k be a tuple of individuals. We can decide P T ∪A∪Q C |= Q P (a 1 , . . . , a k ) in polynomial time w.r.t. the size of K and in non-deterministic polynomial time with respect to the size of both K and q.

Proof. We have already argued that the size of Datalog program P T depends linearly on the size of the TBox T and that the Datalog rewriting Q P , Q C can be computed in Ptime w.r.t. q. Also, we note that the arity of predicates and the number of variables occurring in P T ∪ A ∪ Q C do not depend on K. Finally, from an implementation point-of-view (as suggested in [12]), the built-in predicate Named can be considered as an assertion in the ABox A with a different physical realization: it is not directly stored in the ABox but it is implemented as a procedure which is evaluated during the execution of the program. Clearly, such a procedure can be implemented to run in time polynomial in K. It follows that we can compute the minimal Herbrand model of P T ∪ A ∪ Q C in time polynomial in the size of K [8]. The membership in NP follows directly from the considerations above and from the fact that we can guess and check in nondeterministic polynomial time a match π for Q P in M(P T ∪ A ∪ Q C ).

Conclusions

In this paper, we introduce a query rewriting approach to query answering in EL. In our approach, the process of computing the certain answers to a CQ q over an EL KB K = T , A is divided into two distinct steps. A first preprocessing step in which the TBox T is transformed into a Datalog program P T , whose size is linear in T . Then, at query time, the query q is independently rewritten into a Datalog query Q P , Q C , whose size is polynomial in q. Finally, computing cert(q, K) amounts to evaluating the Datalog query Q P , Q C over P T ∪ A.

In future, we plan to extend our approach to deal with ELH dr ⊥ . Lutz et al. have already proposed a combined approach for query answering in this DL [13]. However, differently from their solution, we would like the Datalog rewriting Q P , Q C to be independent from the role inclusions contained in the TBox. Additionally, we plan to extend our work to cover nominals, which raises the question on how to efficiently handle equality in Datalog [2].

Fig. 1 .1Fig. 1. Transformation of EL Axioms into Rules.

We say that a path p has depth n and we write dep(p) = n; furthermore, tail(p) is the last domain element c n in p. Let paths A (I) denote the set of all paths w.r.t. A occurring in I. The unraveling J of I w.r.t. A is the following interpretation.

Ontologies and databases: the DL-Lite approach DCalvanese De Giacomo DLembo MLenzerini APoggi MRodríguez-Muro RRosati Semantic Technologies for Informations Systems -5th Int. Reasoning Web Summer School (RW STessaris EFranconi 2009. 2009 5689 Tractable query answering and rewriting under description logic constraints HPérez-Urbina BMotik IHorrocks J. Applied Logic 8 2 2010 Pushing the EL envelope FBaader SBrandt CLutz IJCAI, Professional Book Center LPKaelbling ASaffiotti 2005 Pushing the EL envelope further FBaader SBrandt CLutz Proceedings of the OWLED 2008 DC Workshop on OWL: Experiences and Directions KClark PFPatel-Schneider the OWLED 2008 DC Workshop on OWL: Experiences and Directions 2008 On conjunctive query answering in EL RRosati Description Logics CEUR Workshop Proceedings DCalvanese EFranconi VHaarslev DLembo BMotik AYTurhan STessaris 2007 250 Conjunctive query answering in EL using a database system CLutz DToman FWolter Proceedings of the OWLED 2008 Workshop on OWL: Experiences and Directions the OWLED 2008 Workshop on OWL: Experiences and Directions 2008 The combined approach to ontology-based data access RKontchakov CLutz DToman FWolter MZakharyaschev 2011 IJCAI, AAAI Press SAbiteboul RHull VVianu Foundations of databases Addison-Wesley 1995 Query evaluation under the well-founded semantics WChen DSWarren Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. PODS '93 the twelfth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems. PODS '93

New York, NY, USA

ACM 1993 Small datalog query rewriting for EL GStefanoni BMotik IHorrocks 2012 University of Oxford Technical report JDUllman Principles of database and knowledge-base systems

New York, NY, USA

Computer Science Press, Inc 1988 I What you always wanted to know about datalog (and never dared to ask) SCeri GGottlob LTanca IEEE Trans. Knowl. Data Eng 1 1 1989 Conjunctive query answering in the description logic EL using a relational database system CLutz DToman FWolter IJCAI CBoutilier 2009 Making prolog more expressive JLloyd RTopor The Journal of Logic Programming 1 3 1984 Semantic web methods for knowledge managmement SDecker 2002 University of Karlsruhe PhD thesis