Small Datalog Query Rewritings for EL?

                Giorgio Stefanoni, Boris Motik, and Ian Horrocks

               Department of Computer Science, University of Oxford


1     Introduction
Description Logics are a key technology in data management scenarios such as
Ontology-Based Data Access (OBDA), a paradigm in which a DL ontology is
used to provide a conceptual view of the data [1]. An OBDA system transforms
a conjunctive query over the ontology into a query over the data sources [2]. This
transformation is independent of the data, so the OBDA approach can thus be
used in settings where the data sources provide read-only access to the data, and
where the data changes frequently.
    Most existing OBDA systems are based on the DL-Lite family of lightweight
Description Logics [1], which is also the basis for the QL profile of the OWL 2
ontology language. Logics in this family have been designed to allow a conjunc-
tive query posed over the ontology to be rewritten as a first order query over
the data sources—that is, queries are first-order rewritable. The query rewriting
procedure is independent of the data, and the resulting queries can be evaluated
using highly scalable relational database technology. To achieve this, however,
the expressive power of DL-Lite is very restrictive. This prevents the OBDA
approach from being applied in the life science domain, where many ontologies
use DLs from the EL family [3, 4]. This family provides the basis for the EL
profile of OWL2, and many prominent ontologies, such as SNOMED-CT, were
developed using this language.
    The problem of answering conjunctive queries in EL has already been studied
in the literature, and two orthogonal approaches have been proposed. First,
Rosati proposed a pure query rewriting technique which transforms an EL TBox
T and a conjunctive query q into a Datalog program PT ,q [5]. Second, Lutz
et al. introduced a “combined” approach [6, 7]. This technique first materializes
certain facts entailed by the ontology in a precomputation step. Then, each user
query is rewritten into a polynomial first-order query that, when evaluated over
the materialized facts, computes the answers to the user’s query.
    Unfortunately, these two approaches exhibit several shortcomings when ap-
plied in the context of OBDA. In particular, Rosati’s rewriting technique com-
putes for each user query a fresh Datalog program whose size depends on both
the query and the terminology, which could be very inefficient when dealing with
large scale ontologies. The approach by Lutz et al. produces smaller first order
rewritings, but the use of materialization means that the technique is only appli-
cable when the data sources provide read/write access to the data; furthermore,
materialization can be inefficient if the data changes frequently.
?
    This work was supported by EPSRC and Alcatel-Lucent.
    In this paper, we present a pure query rewriting technique to query answering
in EL. Our approach reinterprets the combined approach proposed by Lutz and
colleagues in terms of Datalog. Our rewriting procedure consists of two distinct
steps. The first step rewrites a TBox T into a Datalog program PT , whose size
depends linearly on the size of T . Then, at query time, the conjunctive query q
is rewritten into a Datalog query hQP , QC i, whose size depends polynomially
on q. The two rewriting steps are such that, given an ABox A, deciding whether
QP (a1 , . . . , ak ) follows from PT ∪ QC ∪ A is equivalent to deciding whether
ha1 , . . . , ak i is a certain answer to q over a knowledge base hT , Ai.
    At last, we summarize our main contributions. First, our rewriting approach,
unlike Rosati’s, separates the rewriting of the TBox and the query into two dis-
tinct steps, thus reducing inefficiency when dealing with large ontologies. Second,
our technique does not require the materialization of entailed facts, hence our
solution is in the spirit of OBDA and it avoids the problems associated with
the materialization of large models. Finally, we set the stage for assessing the
utility and the applicability to PT ∪ QC ∪ A of optimized Datalog evaluation
techniques, such as magic sets and SLG resolution [8, 9]. Indeed, heuristic-based
evaluation strategies significantly reduce the number of facts needed to answer
a query, thus potentially improving the performance of our rewriting approach.
In this paper we provide only proof sketches, and refer the reader to [10] for full
proofs.


2    Preliminaries

Description Logic EL

Let NC , NR , NI be pairwise disjoint infinite sets of atomic concepts, atomic roles,
and individuals. Together, the sets NC , NR , and NI form the signature of an EL
language. Whenever the distinction between atomic concepts and atomic roles
is immaterial, we call an element of NC ∪ NR a predicate. The set of EL concept
expressions is inductively defined starting from atomic concepts A ∈ NC and
atomic roles R ∈ NR according to the syntax rule: C → A | C1 u C2 | ∃R.C | >.
    An EL TBox T is a finite set of concept inclusions of the form C v D; an
EL ABox A is a finite set of assertions of the form A(a) or R(a, b) with a and
b individuals; and an EL knowledge base (KB) is a tuple K = hT , Ai, where
T is an EL TBox and A is an EL ABox. We denote with Ind(A) the set of all
individuals occurring in the ABox A. Furthermore, for E either a TBox or an
ABox, Pred(E) is the set of all predicates occurring in E.
    Semantics is given as usual in terms of first-order interpretations I = h∆I , ·I i,
where ∆I is a nonempty domain and ·I is an interpretation function; please re-
fer to [3] for details. In the following, we will extensively use the notion of an
unraveling of an interpretation w.r.t. an ABox. Consider an interpretation I and
an ABox A over an arbitrary EL signature. A path p in I w.r.t. A is a nonempty
finite sequence c1 · R2 · c2 · · · Rn−1 · cn−1 · Rn · cn such that c1 ∈ {aI | a ∈ Ind(A)}
                                                              I
and for all 1 ≤ i ≤ n − 1 we have that hci , ci+1 i ∈ Ri+1        for Ri+1 ∈ NR . We say
that a path p has depth n and we write dep(p) = n; furthermore, tail(p) is the
last domain element cn in p. Let pathsA (I) denote the set of all paths w.r.t. A
occurring in I. The unraveling J of I w.r.t. A is the following interpretation.

  ∆J = pathsA (I)
  aJ = aI
  AJ = {p ∈ pathsA (I) | tail(p) ∈ AI }
  RJ = {haJ , bJ i | R(a, b) ∈ A} ∪ {hp, p · R · ci | {p, p · R · c} ⊆ pathsA (I)}

In this paper, we deal only with normalized EL TBoxes. Let A1 , A, and B be
arbitrary concepts from NC ∪ {>}. We say that an EL TBox T is in normal
form if each axiom in T is in one of the following forms: A v B, A u A1 v B,
A v ∃R.B, or ∃R.A v B. Given an arbitrary EL TBox T , we can compute a
normalized TBox Tnorm of T in linear time [3].


Querying EL KBs

 Let NV be an infinite set of variables disjoint from NI . Together, NV and NI form
 the set NT of terms. A first-order query q is a first-order formula constructed
 from the terms in NT and the predicates from NC ∪ NR [8]. In general, we
 write q = ψ(~x) to express that q is the FO formula ψ whose answer variables are
~x = {x1 , . . . , xk }. A query with k answer variables is a k-ary query. A conjunctive
 query (CQ) is a FO query of the form q = ∃~y .ψ(~x, ~y ), where ψ is a conjunction
 of unary atoms A(s) and binary atoms R(s, t) with s and t terms. The variables
~y are the quantified variables of q. In the following, avar(q) is the set of answer
 variables of q, and qvar(q) is the set of quantified variables. Finally, NV (q) is the
 set of all variables occurring in q, and NT (q) is the set of all terms occurring
 in q. Let q = ψ(~x) be a k-ary FO query with ~x = hx1 , . . . , xk i and let I be an
 interpretation. We say that a k-ary tuple of individuals ha1 , . . . , ak i is an answer
 to q in I, written I |= q[a1 , . . . , ak ], if I satisfies q under the mapping π which
 sets π(xi ) = ai for all 1 ≤ i ≤ k. We call π a match for q in I witnessing
 ha1 , . . . , ak i, written I |=π q. We say that ha1 , . . . , ak i is a certain answer to
 q over K if I |= q[a1 , . . . , ak ], for all models I of K. We denote the set of all
 certain answers to q over K with cert(q, K). Rosati in [5] showed that deciding
 whether a tuple of individuals is a certain answer to q over K is Ptime-complete
 w.r.t. data-complexity (i.e., w.r.t. the size of the ABox); Ptime-complete w.r.t.
 KB complexity (i.e., w.r.t. the size of K); and, NP-complete w.r.t. combined
 complexity (i.e., w.r.t. the size of both K and q).


Datalog

Let NB be a nonempty set of built-in predicates [11]. Then, a Datalog rule r
is an expression of the form

               S(~u) ← S1 (~u1 ), . . . , Sn (~un ), Bn+1 (~un+1 ) . . . , Bm (~um ),
 where n, m ≥ 0, {S, S1 , . . . , Sn } ⊆ NC ∪ NR , {Bn+1 , . . . , Bm } ⊆ NB , and ~u and
~ui are tuples of terms of suitable length. A rule is safe if each variable occurring
 in ~u ∪ ~un+1 ∪ . . . ∪ ~um also occurs in ~u1 ∪ . . . ∪ ~un . Atom S(~u) is the head of the
 rule, and atoms S1 (~u1 ), . . . , Bm (~um ) constitute the body of the rule. Whenever
 the body of a rule r is empty, we call r a fact, and we write the rule as S(~u). A
 Datalog program P is a set of safe Datalog rules. Finally, sch(P ) is the set
 of predicates occurring in P .
      Next, we define the semantics of a Datalog program P using Herbrand
 interpretations [8]. The Herbrand Universe of P is the set of all individuals
 occurring in P . The Herbrand Base of P is the set of all facts that can be
 constructed from the predicates in NC ∪ NR and the individuals in the universe
 of P . A Herbrand interpretation I of P is a subset of the Herbrand Base of P .
 Note that I does not interpret built-in predicates. As usual, we assume that these
 predicates are evaluated over a fixed, possibly infinite Herbrand interpretation
 B [12]. Then I is a model of P w.r.t. B if, for all the rules r in P , we have that

    I ∪ B |= ∀~x(Bm (~un ) ∧ . . . ∧ Bn+1 (~un+1 ) ∧ Sn (~un ) ∧ . . . ∧ S1 (~u1 ) → S(~u)),

where ~x is a tuple consisting of all variables occurring in the rule. The semantics
of a Datalog program P is defined as the minimal Herbrand interpretation I
satisfying P w.r.t. B, written MB (P ). Whenever the program does not contain
built-in predicates, we do not consider the interpretation B and we simply write
M(P ). The semantics of Datalog programs can be defined also by means of a
fixpoint construction. Then, TP is the immediate consequence operator that maps
instances I over sch(P ) to instances over sch(P ) as follows. For each rule r in P ,
if there exists a match π for S1 (~u1 ) ∧ . . . ∧ Sn (~un ) ∧ Bn+1 (~un+1 ) ∧ . . . ∧ Bm (~um )
in I ∪ B, then S(a1 , . . . , ak ) is contained in TP (I) with ai = π(ui ) for each ui ∈ ~u.
One can prove that TP has a minimum fixpoint TPω such that TPω = MB (P ) [8].
    Finally, a Datalog query is a tuple hQP , QC i where QP is a predicate symbol
and QC is a Datalog program. A tuple of individuals ha1 , . . . , ak i is an answer
to hQP , QC i over Datalog program P if P ∪ QC |= QP (a1 , . . . , ak ).


3    Datalog Rewriting for EL TBoxes
In this section, we show how to transform an EL TBox T into a Datalog
program PT whose size depends linearly on T . The transformation is such that,
for an arbitrary EL ABox A, we can use the unraveling of M(PT ∪A) to compute
the answers to conjunctive queries over hT , Ai. Let T be a TBox over an arbitrary
EL signature. Intuitively, for each axiom α occurring in T , the program PT
contains a set of Datalog rules which encode the constraint imposed by α. To
achieve this, we have to overcome two issues.
    First, EL concept inclusions of the form A v ∃R.B require the use of either
existential quantifications or Skolem terms in rule heads; however, Datalog
does not allow neither of the two. In order to solve this issue, we use a technique
that has been introduced for representing canonical models of EL knowledge
bases [3]. That is, for each atomic concept B occurring in T we introduce a fresh
                    Axiom α               Set of rules Θ(α)
                    AvB                     B(X) ← A(X)
                    A1 u A2 v B      B(X) ← A1 (X), A2 (X)
                    ∃R.A v B         B(X) ← R(X, Y ), A(Y )
                                         R(X, oB ) ← A(X)
                    A v ∃R.B
                                           B(oB ) ← A(X)


                 Fig. 1. Transformation of EL Axioms into Rules.


auxiliary individual oB , which represents the class of existentially quantified
individuals of type B. Then, for each axiom of the above form, the program PT
contains the following two rules:
                    R(X, oB ) ← A(X);       B(oB ) ← A(X).
    Second, EL allows for > to occur in concept expressions. Hence, we need to
define in PT a unary predicate >, whose extension—given an ABox A—coincides
with the Herbrand universe of PT ∪ A. To achieve this, we restrict our study
to a subset of all EL ABoxes. In particular, we consider only those ABoxes A
such that Pred(A) ⊆ Pred(T ). That is, each predicate occurring in the ABox A
must occur also in the TBox T . Then, in our Datalog program, for each atomic
concept A and for each atomic role R occurring in T , we add the following rules:
         >(X) ← A(X);        >(X) ← R(X, Y );        >(Y ) ← R(X, Y ).
This is only one of the several ways in which we can encode such a predicate.
In fact, another possibility would be—as suggested by Rosati in [5]—to assume
that each ABox A contains an assertion >(a) for each individual a ∈ Ind(A). We
believe that in the context of OBDA—where the focus is to provide access to
arbitrary data-sources—it is important to make as few assumptions as possible
on the physical realization of the ABox. For this reason, we prefer the option
presented above.
    Next, we formalize the transformation of a TBox T into a Datalog program
PT . Let Aux = {oA | A ∈ NC } ∪ {o> } be a set of auxiliary individuals distinct
from NI . Then, the program PT is constructed from terms in NT ∪ Aux and
predicates in NC ∪ NR ∪ {>} as follows. The transformation uses the function
Θ, shown in Figure 1, to transform each axiom in the (normalized) TBox T into
a set of Datalog rules. The Datalog program PT is then defined as follows.
                 S
           PT = Sα∈T             Θ(α)
                   A∈Pred(T )∩NC
                                 >(X) ← A(X)
                 S
                   R∈Pred(T )∩NR >(X) ← R(X, Y ), >(Y ) ← R(X, Y )

The following result readily follows from the definition of the program.
Proposition 1. For an arbitrary EL TBox T , Datalog program PT can be
computed in time linear in the size of T .
   Consider an arbitrary EL ABox A. Next, we prove that the unraveling U
w.r.t. A of M(PT ∪A) can be used to answer conjunctive queries over K = hT , Ai.
We do so in two distinct steps. First, we introduce the notion of chase of an EL
knowledge base K. Second, we show that the chase of K is isomorphic to U.
   The chase of an EL knowledge base K = hT , Ai, written chase(K), is a
possibly infinite Herbrand interpretation defined inductively by starting from A
and then applying axioms occurring in the TBox to assertions occurring in the
ABox. In our definition of the chase, we use function terms to denote existentially
quantified individuals. Hence, the definition of ABox assertion is extended in a
natural way to accommodate for assertions over function terms. We denote with
u and w terms that can be either individuals or function terms. Next, we define
an operator ΓT that chases an ABox by applying the axioms occurring in the
TBox T . In the definition, we use assertions of the form >(u) to assert that u is
a member of the EL concept expression >. For S an arbitrary ABox, ΓT (S) is
the smallest ABox containing S and closed under the following chasing rules.

(cr1) If {A(u)} ⊆ S and A v B ∈ T , then {B(u)} ⊆ ΓT (S).
(cr2) If {A1 (u), A2 (u)} ⊆ S and A1 u A2 v B ∈ T , then {B(u)} ⊆ ΓT (S).
(cr3) If {R(u, w), A(w)} ⊆ S and ∃R.A v B ∈ T , then {B(u)} ⊆ ΓT (S).
(cr4) If {A(u)} ⊆ S and A v ∃R.B ∈ T , then

                         {R(u, f (u, R, B)), B(f (u, R, B))} ⊆ ΓT (S).

(cr5) If u occurs in S, then {>(u)} ⊆ ΓT (S).

We now define an infinite sequence of finite ABoxes Ai for i ∈ N.

                                      A0 = A
                                      Ai+1 = ΓT (Ai )

Finally, the chase of K is the infinite union of all such ABoxes Ai .
                                             [
                                chase(K) =       Ai
                                                    i∈N

It is clear that our construction of the chase of K is fair. In fact, for each i ∈ N we
have that Ai+1 is the result of exhaustively applying—to all possible assertions
occurring in Ai —all applicable axioms in T . At last, we want to point out that
chase(K) can be used to compute the certain answers to a CQ q over K.

Proposition 2 ([5]). Let K be an EL knowledge base. Further, let q be a k-ary
conjunctive query. Then, for each k-ary tuple of individuals ha1 , . . . , ak i, we have

         ha1 , . . . , ak i ∈ cert(q, K) if and only if chase(K) |= q[a1 , . . . , ak ].

   So, by proving that the unraveling U of M(PT ∪A) is isomorphic to chase(K),
we establish that U can be used to answer conjunctive queries over K. To prove
the structural equivalence of U and chase(K), we define a function h mapping
paths occurring in U to terms in chase(K). We define h by induction on the
depth of paths occurring in U as follows.
    Base Case. Consider an arbitrary p ∈ ∆U with dep(p) = 1. We set h(p) := p.
    Inductive Step. Let p = t1 · R2 · t2 · · · tn−1 · Rn · tn be a path occurring
in U such that h(p) has not been defined yet, but h(t1 · · · Rn−1 · tn−1 ) = u. We
distinguish between two cases depending on the type of the individual tn .
 1. If tn occurs in the ABox, we set h(p) := tn .
 2. If tn is of the form oB , we set h(p) := f (u, Rn , B).
    Theorem 1 shows that h is an isomorphism between the two structures. In-
tuitively, for the only-if direction, we show that h is an injective homomorphism
from U to chase(K) by induction on the depth of paths occurring in U; for the
if-direction, by induction on the construction of chase(K) we prove that h is a
surjective function and that it is a homomorphism from chase(K) to U.

Theorem 1. Function h is an isomorphism from U to chase(K).

   Since the unraveling of M(PT ∪ A) is generally infinite, this result alone does
not provide us with an algorithm for answering queries in EL. In the next section,
we show how to rewrite a user query q into a Datalog query hQP , QC i such
that PT ∪ A ∪ QC |= QP (a1 , . . . , ak ) if and only if ha1 , . . . , ak i ∈ cert(q, K) and
thus solve the problem.


4    Polynomial Query Rewriting in Datalog
In the previous section, we have seen that for an arbitrary EL KB K = hT , Ai
evaluating a conjunctive query q over the unraveling of the Herbrand model
of PT ∪ A is equivalent to computing the certain answers to q over K. In this
section, we develop a query rewriting procedure that reduces the computation of
cert(q, K) to the problem of evaluating a suitably constructed Datalog query
over PT ∪ A. We achieve this in two steps. First, we present an interesting
property of a certain class of interpretations. Second, we show how this result
can be used to develop a query rewriting procedure in our Datalog setting.
     We use the notions of A-connected and split interpretations from [6, 13]. Let
I be an interpretation and let A be an ABox over an arbitrary EL signature.
We say that I is A-connected if, for each domain element c ∈ ∆I , there exists a
path p ∈ pathsA (I) such that tail(p) = c. Furthermore, I is a split interpretation
if, for all domain elements c, c0 ∈ ∆I , we have that c 6∈ {aI | a ∈ Ind(A)}
and hc, c0 i ∈ RI imply c0 6∈ {aI | a ∈ Ind(A)}. Intuitively, in an A-connected
interpretation I, for each domain element cn it is always possible to find a path
aI · R2 · c2 · · · Rn · cn such that a ∈ Ind(A). Furthermore, if I is split, then each
domain element that is not the image of an individual can be related by a role
only with elements that themselves do not interpret individuals.
     Then, let J be the unraveling w.r.t. A of a split and A-connected interpre-
tation I and let q be a conjunctive query. Lutz et al. in [6, 13] showed that it is
possible to reduce the problem of answering q in J to evaluating a first-order
query rewriting q ∗ of q over I. Roughly speaking, the query rewriting q ∗ rules
out some spurious answers for q in I that cannot be reproduced in J . More
specifically, we have to ensure that the answer variables of q, the variables of
q mapped to cyclic portions of I, and the variables of q mapped to nontree
portions of I are all matched only to the domain elements in {aI | a ∈ Ind(A)}.
    We now briefly outline how we can construct such an FO rewriting q ∗ for
q [6]. Let ∼q be the smallest equivalence relation over NT (q) that is closed under
the following rule: if R(s, t) and R(s0 , t0 ) occur in q and t ∼q t0 , then s ∼q s0 .
Then, for each equivalence class ζ of ∼q , we let tζ ∈ ζ be an arbitrary, but fixed,
representative of the class. Also, for each such equivalence class ζ and for each
atomic role R occurring in q, we let Pred(ζ, R) be the following set.
            Pred(ζ, R) = {t ∈ NT (q) | R(t, t0 ) occurs in q with t0 ∈ ζ}
Next, we define three sets of terms that correspond to the above mentioned cases.
– Fork= is the set of all pairs hPred(ζ, R), tζ i such that ζ is an equivalence class
  of ∼q and |Pred(ζ, R)| > 1.
– Fork6= is the set of all quantified variables v ∈ qvar(q) for which atoms R(s, v)
  and S(s0 , t) exist in q such that R 6= S and v ∼q t.
– Cyc is the set of all variables v ∈ qvar(q) for which atoms
                       R0 (t0 , t00 ), . . . , Rm (tm , t0m ), . . . , Rn (tn , t0n )
   exist in q such that m, n ≥ 0; for some i ≤ n we have that ti ∼q v; for each
   j < n we have that t0j ∼q tj+1 ; and t0n ∼q tm .
We are now ready to formally specify the FO query rewriting q ∗ . In the definition,
we assume that Aux is a fresh predicate not occurring in q and K and that every
interpretation I interprets Aux as ∆I \ {aI | a ∈ Ind(A)}. Then, formulae q1 and
q2 are defined as follows.
                           ^
               q1 =                   ¬Aux(v)
                     v∈avar(q)∪Fork6= ∪Cyc
                               ^                                      ^
              q2 =                            ¬Aux(tζ ) ∨                        (t = t0 )
                     hPred(ζ,R),tζ i∈Fork=                     t,t0 ∈Pred(ζ,R)

Finally, we set q ∗ = q∧q1 ∧q2 . It turns out that q ∗ can be computed in polynomial
time w.r.t. q [6]. In the same paper, Lutz et al. prove the following result.
Proposition 3. Let A be an arbitrary EL ABox, let I be a split and A-connected
interpretation, and let J be the unraveling of I w.r.t. A. Then, for every k-tuple
of individuals ha1 , . . . , ak i, we have that
                I |= q ∗ [a1 . . . , ak ] if and only if J |= q[a1 . . . , ak ].
    This result applies to our Datalog rewriting of EL TBoxes. Indeed, for an
arbitrary EL KB K = hT , Ai, we have that M(PT ∪A) is a split and A-connected
interpretation. The intuition behind the argument is as follows. We show that
M(PT ∪ A) is split by noticing that rules encoded in PT do not allow for the
derivation of facts of the form R(oB , a) for a ∈ Ind(A) and oB ∈ Aux. To see
that M(PT ∪ A) is A-connected, we just recall that M(PT ∪ A) is minimal and,
hence, all the derived facts must be “grounded” w.r.t. the facts in A.
Theorem 2. Let K = hT , Ai be an EL knowledge base. Then, M(PT ∪ A) is a
split and A-connected interpretation.
Hence, for an arbitrary k-ary CQ q and for each k-tuple of individuals ha1 , . . . , ak i,
we have that M(PT ∪ A) |= q ∗ [a1 . . . , ak ] if and only if ha1 , . . . , ak i ∈ cert(q, K).
    Note that q ∗ is a first-order query, and we are unaware of systems capa-
ble of evaluating first-order queries over Datalog programs. Therefore, we
next show how to transform q ∗ into a Datalog query hQP , QC i such that
ha1 , . . . , ak i ∈ cert(q, K) if and only if PT ∪ A ∪ QC |= QP (a1 . . . , ak ). We con-
struct such a Datalog query hQP , QC i by applying to q ∗ a simplified version
of the Lloyd-Topor transformation [14, 15].
Definition 1 (Datalog Rewriting). Let q(~x) be a k-ary CQ whose quanti-
fied variables are among ~y ; let Cyc, Fork6= , and Fork= be as specified above; let
hPred(ζ 1 , R1 ), t1ζ i, . . ., hPred(ζ n , Rn ), tnζ i be an arbitrary enumeration of Fork= ;
let p0 , p1 , . . . , pn be fresh predicates; and let Named be a built-in with a prede-
termined, possibly infinite Herbrand interpretation N = {Named(a) | a ∈ NI }.
Query QC then contains the following safe Datalog rules:
                                      ^
            p0 (~x, ~y ) ← q,                         Named(v)                            (1)
                            v∈avar(q)∪Fork6= ∪Cyc

          pi (~x, ~y ) ← pi−1 (~x, ~y ), Named(tiζ )                       for 1 ≤ i ≤ n   (2)

                                                   ^
          pi (~x, ~y ) ← pi−1 (~x, ~y ),                          t = t0   for 1 ≤ i ≤ n   (3)
                                           t,t0 ∈Pred(ζ i ,Ri )

           QP (~x) ← pn (~x, ~y )                                                          (4)
One may think that the recursive definition of predicates pi for 1 ≤ i ≤ n could
be simplified by writing QP (~x) ← p0 (~x, ~y ) . . . pn (~x, ~y ) and by defining each pi as:
            pi (~x, ~y ) ← Named(tiζ )  pi (~x, ~y ) ← t,t0 ∈Pred(ζ i ,Ri ) t = t0
                                                          V

Unfortunately, these rules are not safe. Safe rules, on the one hand, provide
us with a clear and unambiguous semantics. On the other hand, unsafe rules
are also computationally more expensive for bottom-up computation, since each
variable in the head may be bound to an arbitrary individual in the universe of
the program. For this reason, we prefer our, slightly more involved, solution.
Proposition 4. For an arbitrary k-ary conjunctive query q, query hQP , QC i
can be computed in polynomial time w.r.t. the size of q.
Proof. We note that ∼q can be computed in polynomial time w.r.t. the size of
q [6] and, therefore, also the sets Cyc, Fork6= , and Fork= can be computed in poly-
nomial time w.r.t. q. Furthermore, the size of the body of rule p0 (~x, ~y ) depends
linearly on the size of q, Cyc, and Fork6= . Also, for each pair hPred(ζ, R), tζ i in
Fork= , the program QC contains exactly two rules. The size of these two rules
depends linearly on the size of hPred(ζ, R), tζ i. Thus, we conclude that hQP , QC i
can be computed in polynomial time with respect to the size of q.                   t
                                                                                    u
    In [10], we prove that our rewriting is correct—that is, that answering hQP , QC i
over PT ∪ A is equivalent to computing cert(q, T , A). This follows from Propo-
sition 3 and the fact that our Datalog query is the result of transforming the
FO rewriting q ∗ along the lines of the Lloyd-Topor transformation.

Theorem 3. Let K be an EL knowledge base and let q be a k-ary CQ over K.
Then, for every k-tuple of individuals ha1 , . . . , ak i, we have that

     ha1 , . . . , ak i ∈ cert(q, K) if and only if PT ∪ A ∪ QC |= QP (a1 , . . . , ak ).

Finally, we investigate the complexity of our rewriting procedure.

Theorem 4. Let K = hT , Ai be an EL KB, let q be a k-ary CQ, and let
ha1 , . . . , ak i be a tuple of individuals. We can decide PT ∪A∪QC |= QP (a1 , . . . , ak )
in polynomial time w.r.t. the size of K and in non-deterministic polynomial time
with respect to the size of both K and q.

Proof. We have already argued that the size of Datalog program PT depends
linearly on the size of the TBox T and that the Datalog rewriting hQP , QC i
can be computed in Ptime w.r.t. q. Also, we note that the arity of predicates
and the number of variables occurring in PT ∪ A ∪ QC do not depend on K.
Finally, from an implementation point-of-view (as suggested in [12]), the built-in
predicate Named can be considered as an assertion in the ABox A with a different
physical realization: it is not directly stored in the ABox but it is implemented
as a procedure which is evaluated during the execution of the program. Clearly,
such a procedure can be implemented to run in time polynomial in K. It follows
that we can compute the minimal Herbrand model of PT ∪ A ∪ QC in time
polynomial in the size of K [8]. The membership in NP follows directly from
the considerations above and from the fact that we can guess and check in
nondeterministic polynomial time a match π for QP in M(PT ∪ A ∪ QC ).            t
                                                                                 u


5    Conclusions

In this paper, we introduce a query rewriting approach to query answering in EL.
In our approach, the process of computing the certain answers to a CQ q over an
EL KB K = hT , Ai is divided into two distinct steps. A first preprocessing step
in which the TBox T is transformed into a Datalog program PT , whose size
is linear in T . Then, at query time, the query q is independently rewritten into
a Datalog query hQP , QC i, whose size is polynomial in q. Finally, computing
cert(q, K) amounts to evaluating the Datalog query hQP , QC i over PT ∪ A.
                                                                   dr
     In future, we plan to extend our approach to deal with ELH⊥      . Lutz et al.
have already proposed a combined approach for query answering in this DL [13].
However, differently from their solution, we would like the Datalog rewriting
hQP , QC i to be independent from the role inclusions contained in the TBox.
Additionally, we plan to extend our work to cover nominals, which raises the
question on how to efficiently handle equality in Datalog [2].
References
 1. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A., Rodrı́guez-
    Muro, M., Rosati, R.: Ontologies and databases: the DL-Lite approach. In Tessaris,
    S., Franconi, E., eds.: Semantic Technologies for Informations Systems – 5th Int.
    Reasoning Web Summer School (RW 2009). Volume 5689. (2009) 255–356
 2. Pérez-Urbina, H., Motik, B., Horrocks, I.: Tractable query answering and rewriting
    under description logic constraints. J. Applied Logic 8(2) (2010) 186–209
 3. Baader, F., Brandt, S., Lutz, C.: Pushing the EL envelope. In Kaelbling, L.P.,
    Saffiotti, A., eds.: IJCAI, Professional Book Center (2005) 364–369
 4. Baader, F., Brandt, S., Lutz, C.: Pushing the EL envelope further. In Clark, K.,
    Patel-Schneider, P.F., eds.: In Proceedings of the OWLED 2008 DC Workshop on
    OWL: Experiences and Directions. (2008)
 5. Rosati, R.: On conjunctive query answering in EL. In Calvanese, D., Franconi, E.,
    Haarslev, V., Lembo, D., Motik, B., Turhan, A.Y., Tessaris, S., eds.: Description
    Logics. Volume 250 of CEUR Workshop Proceedings., CEUR-WS.org (2007)
 6. Lutz, C., Toman, D., Wolter, F.: Conjunctive query answering in EL using a
    database system. In: Proceedings of the OWLED 2008 Workshop on OWL: Expe-
    riences and Directions. (2008)
 7. Kontchakov, R., Lutz, C., Toman, D., Wolter, F., Zakharyaschev, M.: The com-
    bined approach to ontology-based data access. In: IJCAI, AAAI Press (2011)
 8. Abiteboul, S., Hull, R., Vianu, V.: Foundations of databases. Addison-Wesley
    (1995)
 9. Chen, W., Warren, D.S.: Query evaluation under the well-founded semantics.
    In: Proceedings of the twelfth ACM SIGACT-SIGMOD-SIGART symposium on
    Principles of database systems. PODS ’93, New York, NY, USA, ACM (1993)
    168–179
10. Stefanoni, G., Motik, B., Horrocks, I.:             Small datalog query rewrit-
    ing for EL.        Technical report, University of Oxford (2012) Available at
    http://www.cs.ox.ac.uk/people/giorgio.stefanoni/pubs/2012/tr/dl2012.pdf.
11. Ullman, J.D.: Principles of database and knowledge-base systems, Vol. I. Computer
    Science Press, Inc., New York, NY, USA (1988)
12. Ceri, S., Gottlob, G., Tanca, L.: What you always wanted to know about datalog
    (and never dared to ask). IEEE Trans. Knowl. Data Eng. 1(1) (1989) 146–166
13. Lutz, C., Toman, D., Wolter, F.: Conjunctive query answering in the description
    logic EL using a relational database system. In Boutilier, C., ed.: IJCAI. (2009)
    2070–2075
14. Lloyd, J., Topor, R.: Making prolog more expressive. The Journal of Logic Pro-
    gramming 1(3) (1984) 225 – 240
15. Decker, S.: Semantic web methods for knowledge managmement. PhD thesis,
    University of Karlsruhe (2002)