Conjunctive Query Answering in EL using
                a Database System

              Carsten Lutz1 , David Toman2 , and Frank Wolter3
                            1
                              Fachbereich Informatik
                          Universität Bremen, Germany
                         clu@informatik.uni-bremen.de
                   2
                     D.R. Cheriton School of Computer Science
                         University of Waterloo, Canada
                              david@uwaterloo.ca
                       3
                         Department of Computer Science
                           University of Liverpool, UK
                             frank@csc.liv.ac.uk


      Abstract. We study conjunctive query answering in the description
      logic EL, the core of the designated OWL2-EL profile of OWL2. In
      particular, we present an approach that allows the use of conventional
      relational database technology for conjunctive query answering in EL.
      This approach is inspired by the OWL support in Oracle Database 11g,
      semantic technologies.


1   Introduction
The OWL ontology language is currently being revisited by a W3C working
group with the goal of designing a revised and extended version of OWL, called
OWL2 [1]. The new recommendation aims to include a number of profiles, frag-
ments of OWL2 that trade expressive power for various other desirable properties
not enjoyed by the full language. At the time of writing, there were three pro-
files designated to become part of the OWL2 recommendation: OWL2-EL based
on the description logic EL++ [4, 5], OWL2-QL based on the description logic
DL-LiteA [7], and OWL2-RL inspired by the pD∗ semantics for OWL [17]. Each
of these fragments aims at achieving a different benefit while simultaneously
maximizing expressive power. In the case of OWL2-EL, the design goal is clas-
sification of ontologies in polynomial time. This profile has already been widely
adopted by large-scale ontologies in which efficiency of reasoning plays a central
role. In particular, many life-science ontologies such as SNOMED CT and NCI
are formulated in OWL2-EL [16, 15].
     In contrast to OWL2-EL, the design goal of both OWL2-QL and OWL2-RL
is to enable the use of database technology for query answering over OWL in-
stance data: many applications access considerable amounts of such data, often
in the range of millions of objects. This makes it necessary to store the data
in secondary storage and to develop highly efficient querying mechanisms. Re-
lational database systems often satisfy these needs and benefit from decades of
research in academia and industry. Thus they are natural candidates for im-
plementing a query processing solution for OWL. However, employing such an
approach is rather challenging due to significant differences between relational
database systems and OWL: first, relational systems adopt a closed-world se-
mantics, i.e., all facts that are not explicitly stated to be true are assumed to be
false. In contrast, OWL is based on an open world semantics which does not re-
quire one to fix the truth value of every fact and is more similar to an incomplete
database [10]. Second, database systems are unaware of the intensional part of
an OWL ontology (called henceforth the TBox).
    In OWL2-QL and OWL2-RL these difficulties are addressed in different ways.
In a nutshell, the OWL2-QL approach suggests employing a backward chaining
approach while OWL2-RL suggests forward chaining. In both cases the instance
data (called the ABox) is stored as a relational database instance. In OWL2-QL,
a query q is rewritten into a new query q ∗ that takes into account the TBox.
This rewriting is independent of the data. The resulting query q ∗ is passed
to the database system to retrieve answers. In this way, OWL2-QL allows the
use of an off-the-shelf relational database system with the help of an external
query rewriting component. In contrast, query answering in OWL2-RL relies on
a data preprocessing phase. This phase is initiated after each modification of
the instance data and adds data that is entailed by the explicitly given data
and the TBox. Data preprocessing is independent of the queries to be answered.
To actually answer a query, it then suffices to pass it to the relational query
engine without rewriting. The OWL2-RL approach thus suggests that the data
management facilities of the database system should be modified to incorporate
a data preprocessing phase, but the query engine is left untouched. Notably,
OWL2-RL and the described approach to query answering is used in the Oracle
Database 11g semantic technologies [18].
    In this paper, we study conjunctive query answering in the description logic
EL using relational database systems. Since EL can be viewed as the core of the
OWL2-EL profile and is successfully used in the design of large-scale ontologies,
adding efficient querying capabilities is of immediate practical use, e.g., in ap-
plications of OWL in medical informatics in which an ABox is used to describe
medical records and where a medical EL ontology, such as SNOMED CT, assigns
a meaning to the medical terms used in these records [12, 13]. Unsurprisingly,
very large ABoxes are frequently encountered in such applications and efficient
querying is a crucial requirement. Conjunctive query answering in EL has also
been studied in [9, 14, 8], but not in the context of database systems.
    The pure backward chaining approach of OWL2-QL cannot be applied to
EL since that approach is limited to DLs for which the data complexity of con-
junctive query answering is in LogSpace, while it has been shown that EL is
PTime-hard in this respect [6]. Pure forward chaining is not applicable either
because it may cause the introduction of infinitely many new database objects
(due to the use of existential restrictions on the right-hand side of concept inclu-
sions in a TBox). For these reasons, we present a novel approach to conjunctive
query answering that has some similarity to both of the approaches above. It
can be summarized as follows similarly to Oracle 11g semantic technologies, we
use a data preprocessing phase that adds new data to the database. This phase
depends on the ABox and the TBox, but not on the queries to be answered.
The data added during our preprocessing phase is auxiliary: it is only used in-
ternally by the ontology-aware database system for query answering and hidden
from the user. To answer a query, we first rewrite it and then pass it on to the
database query engine. Hence, our approach shares the main virtues of the two
approaches for OWL2-QL and OWL2-RL above; in particular it leaves the query
engine of the relational database system unchanged. Technically, our approach
relies on the fact that, in EL, conjunctive queries can be answered by considering
a single canonical model. During the preprocessing phase, we complete the data
in the database to such a model. The main challenge of pursuing this approach
is to work with a finite canonical model while still giving correct answers to all
conjunctive queries.
    It is worth mentioning that the query rewriting in our approach is of a rather
different nature than the rewriting done for OWL2-QL: it does not correspond
to backward chaining and is independent of the data and of the TBox. An-
other significant difference from OWL2-QL is that data preprocessing needs
only quadratic time and query rewriting needs only linear time. In contrast, the
query rewriting step of OWL2-QL may induce an exponential blowup.
    The remainder of this paper is organized as follows. In Section 2, we intro-
duce the description logic EL and conjunctive query answering. Section 3 then
provides an overview of our approach, introduces canonical models, and provides
a first example. The data preprocessing phase is described in Section 4 and query
rewriting is discussed in Section 5. In Section 6, we conclude with pointing out
future research issues.


2   Preliminaries

We briefly introduce the description logic EL and conjunctive queries. In EL,
concepts are built according to the syntax rule

                           C ::= A | > | C u D | ∃r.C

where, here and in the remaining paper, A ranges over concept names taken from
a countably infinite set NC , r ranges over role names taken from a countably
infinite set NR , and C, D range over concepts. A TBox is a finite set of concept
inclusions C v D, and an ABox is a finite set of concept assertions A(a) and role
assertions r(a, b), where a, b range over a countably infinite set NI of individual
names. A knowledge base is a pair (T , A) with T a TBox and A an ABox.
   As usual, an interpretation is a pair (∆I , ·I ) with ∆I a non-empty domain
and ·I an interpretation function that maps each concept name A to a subset
AI ⊆ ∆I , each role name r to a binary relation rI ⊆ ∆I × ∆I , and each
individual name a to an element aI ∈ ∆I The interpretation function is extended
to composite concepts by setting

                     >I = ∆ I
               (C u D)I = C I ∩ DI
                (∃r.C)I = {d ∈ ∆I | ∃e ∈ ∆I : (d, e) ∈ rI ∧ e ∈ C I }.

An interpretation I satisfies a concept inclusion C v D if C I ⊆ DI , a concept
assertion A(a) if aI ∈ AI , and a role assertion r(a, b) if (aI , bI ) ∈ rI . I is a
model of a TBox T (ABox A) if it satisfies all concept inclusions in T (assertions
in A). It is a model of a knowledge base K = (T , A) if it is a model of T and
A. For a concept inclusion or assertion α, we write K |= α if α is satisfied in all
models of K. If empty, A is simply omitted.
   Let NV be a countably infinite set of variables. Together, the set of NV of vari-
ables and the set NI of individual names form the set NT of terms. A conjunctive
query is an expression of the form ∃u.ϕ(v, u), where

 – u = u1 , . . . , un and v = v1 , . . . , vm are vectors of variables and
 – ϕ is a conjunction of concept atoms A(t) and role atoms r(t, t0 ), where A
   ranges over concept names, r over role names, and t, t0 are terms from the
   set v ∪ u ∪ NI .

The variables in u are called quantified variables, and the variables in v are
answer variables. We call the query k-ary if there are k answer variables, use
var(q) to denote the set of all variables in q, qvar(q) for the set of quantified
variables, avar(q) for the set of answer variables, and term(q) for the terms in q.
Slightly abusing notation, we write α ∈ q if the concept or role atom α occurs
in q.
    Let I be an interpretation and q = ∃u.ϕ(v, u) a conjunctive query. A match
for I and q is a mapping π : term(q) → ∆I such that π(a) = aI for all a ∈
term(q) ∩ NI and all atoms in q are satisfied, i.e.,

 – π(t) ∈ AI for all A(t) ∈ q and
 – (π(t), π(t0 )) ∈ rI for all r(t, t0 ) ∈ q.

If v = v1 , . . . , vk with π(vi ) = aIi for 1 ≤ i ≤ k, then π is called an (a1 , . . . , ak )-
match for I and q. If such a match exists, we write I |= q[a1 , . . . , ak ]. A certain
answer for a k-ary conjunctive query q and a knowledge base K is a tuple
(a1 , . . . , ak ) such that a1 , . . . , ak occur in K and, for each model I of K, there
is an (a1 , . . . , ak )-match for I and q. We use cert(q, K) to denote the set of
all certain answers for q and K. This defines the querying problem studied in
this paper: given an EL-knowledge base K and a conjunctive query q, we want
to compute cert(q, K). Observe that the definition of certain answers reflects
the open world semantics of ABoxes: we quantify over all possible models of K
instead of treating the ABox itself as a model.
   Without loss of generality, we admit only concept names in a conjunctive
query, instead of composite concepts. Indeed, it is easily possible to unfold an
EL-concept into the query. For example, the query ∃u.r(u, v) ∧ (A u ∃s.B)(u)
can be unfolded into ∃u, u0 .r(u, v) ∧ A(u) ∧ s(u, u0 ) ∧ B(u0 ).
   In the remainder of this paper, we make the unique name assumption, i.e.,
we assume that aI 6= bI for all interpretations I and all a, b ∈ NI with a 6= b.
The following (easy to prove) lemma shows that assuming UNA does not impact
the certain answers.

Lemma 1. Let K be a knowledge base, q a k-ary conjunctive query, and let
a1 , . . . , ak ∈ NI . If there is a model I of K and an (a1 , . . . , ak )-match for I
and q, then there is a model I 0 of K that respects the UNA and an (a1 , . . . , ak )-
match for I 0 and q.


3    Canonical Models

Every EL knowledge base K = (T , A) has a canonical model IK that enjoys
many pleasant properties [2, 11]. To define IK , let sub(T ) denote the set of all
subconcepts of concepts used in T , Ind(A) the set of individual names that occur
in A, and set NIaux := {xC | C ∈ sub(T )}. Then

           ∆IK := Ind(A) ∪ NIaux
           AIK := {a ∈ ind(A) | K |= A(a)} ∪ {xC ∈ NIaux | K |= C v A}
           rIK := {(a, b) ∈ ind(A) × ind(A) | r(a, b) ∈ A}∪
                  {(a, xC ) ∈ ind(A) × NIaux | K |= ∃r.C(a)}∪
                  {(xC , xD ) ∈ NIaux × NIaux | K |= C v ∃r.D}
           aIK := a for all a ∈ ind(K)

Among the pleasant properties of IK is that it can be used to answer instance
queries. Such a query has the form C(a) and is entailed by a knowledge base K
if aI ∈ C I for all models I of K. It is well-known that C(a) is entailed by K
iff a ∈ C IK , and thus we can answer instance queries w.r.t. K by looking only
at the single model IK [4]. Since A can be viewed as being a fragment of IK ,
this observation suggests that we can use a database system to answer instance
queries as follows: we store A as a database, complete A to IK in a preprocessing
phase, and then just pass instance queries to the database system.
    In principle, it is the same approach that we want to use for conjunctive
queries. Unfortunately, for conjunctive queries q it is not true that (a1 , . . . , ak ) ∈
cert(q, K) iff IK |= q[a1 , . . . , qk ] for all a1 , . . . , ak ∈ NI with k the arity of q. To see
this, let us consider four examples. First, take the knowledge base K1 = (T1 , A1 )
with
                          T1 = {A v ∃r.B}
                         A1 = {A(a), A(b), r(a0 , c0 ), r(b0 , c0 )}
                           q1 = ∃u.r(v, u) ∧ r(v 0 , u)
We have (a, xB ) ∈ rIK1 and (b, xB ) ∈ rIK1 , and thus IK1 |= q[a, b]. However,
(a, b) ∈
       / cert(q1 , K1 ) because it is easy to find a model I of K1 with I 6|= q1 [a, b].
Second, take
                               T2 = {A v ∃r.B u ∃s.B}
                               A2 = {A(a)}
                               q2 = ∃u.r(v, u) ∧ s(v, u)

Similarly to our first example, it is not hard to see that IK2 |= q2 [a], but a ∈
                                                                                /
cert(q2 , K2 ). Our third example is K3 = (T3 , A3 ) with

                             T3 = {A v ∃r.B, B v ∃s.B}
                             A3 = {A(a)}
                             q3 = ∃u.r(v, u) ∧ s(u, u)

Again, IK3 |= q3 [a], but a ∈/ cert(q3 , K3 ). All the three examples above show
situations in which matches can be found in the canonical model, matches that
do not exist in the unraveling of this model. Our last example is K4 = (T4 , A4 )
with
                              T4 = {A v A}
                              A4 = {B(a)}
                              q4 = ∃u.B(v) ∧ A(u)

illustrates a different difficulty: while xA ∈ AIK4 , in the unraveled model the
interpretation of A is empty: hence, IK4 |= q4 [a], but a ∈   / cert(q4 , K4 ). This
situation can be detected by observing that the value x4 is not reachable from
a by a role chain in IK4 .
    In principle, all of these problems can be overcome by replacing IK with its
unraveling into a less constrained, tree-like model. In the following, we introduce
unraveling as a general operation on models of a knowledge base. Let K be a
knowledge base and I a model of K. We use Ind(A)I to denote the set {aI |
a ∈ Ind(A)}. A path in I is a finite sequence d0 r1 d1 · · · rn dn , n ≥ 0, where
d0 ∈ Ind(A)I and, for all i < n, (di , di+1 ) ∈ ri+1
                                                 I
                                                     . We use paths(I) to denote the
set of all paths in I. If p ∈ paths(I), then tail(p) denotes the last element dn
in p. Now the unraveling J of I is defined as follows:

   ∆J := paths(I)
   aJ := aI
   AJ := {p | tail(p) ∈ AI }
   rJ := {(d, e) | d, e ∈ Ind(A)I ∧ (d, e) ∈ rI } ∪ {(p, p · re) | p, p · re ∈ ∆I }

where “·” denotes concatenation. The following result is proved in [8] for the
extension of EL with inverse and functional roles and for the case of 0-ary queries.
An extension to k-ary queries is straightforward.

Lemma 2. Let K be a knowledge base and UK the unraveling of IK . For all k-
ary conjunctive queries q and individual names a1 , . . . , ak , we have (a1 , . . . , ak ) ∈
cert(q, K) iff UK |= q[a1 , . . . , ak ].
Alas, UK may be infinite and thus we cannot use it as a database. The solution
is to continue working with IK , but to rewrite q into q ∗ such that each match
of q ∗ in IK can be reproduced as a match of q in UK and vice versa, thus
IK |= q ∗ [a1 , . . . , ak ] iff UK |= q[a1 , . . . , ak ] for all a1 , . . . , ak ∈ NI . We defer a
formal definition and concrete examples of this rewriting to Section 5.


4     Data Preprocessing
We describe the data preprocessing phase of our approach to conjunctive query
answering in EL, where we insert additional, auxiliary data. The purpose of this
phase is to complete the ABox A initially stored in the database to (a fragment
of) the canonical model IK . In principle, the auxiliary data has to be updated
each time that A and T are modified (for the purposes of query answering,
however, it makes sense to assume that T is essentially immutable). To keep
the presentation simple, we concentrate on the case where the auxiliary data
is computed from scratch. In an actual implementation, it would be preferable
to modify the auxiliary data in an incremental way when data is inserted to or
deleted from A.
    Let K = (T , A) be the knowledge base over which queries are to be an-
swered. We store the ABox A as data in a database system in the obvious way:
individual names become database objects, concept names become unary tables,
and role names become binary tables. When we understand A as a database,
we denote it with Adb and use ans(q, Adb ) to denote the set of answers that a
relational database system returns for q over the database Adb . As explained in
the previous section, cert(q, K) and ans(q, Adb ) are not identical in general. The
precompletion phase extends Adb to a database A∗db that represents (a fragment
of) the canonical model IK . In the following, we present details of how this
completion can be achieved using the query mechanism of a relational system.
    In A∗db , we use the elements of NIaux as additional database objects, assuming
that NIaux ∩ NI = ∅. We also introduce an additional unary database table Aux
that is used to identify elements of NIaux and distinguish them from individual
names in A.4 First, we use a reasoner such as CEL [3] to compute the canonical
model of the TBox T defined as follows:
                 ∆IT := NIaux
                 AIT := {xC ∈ NIaux | T |= C v A}
                 rIT := {(xC , xD ) ∈ NIaux × NIaux | T |= C v ∃r.D}

This model should not to be confused with the canonical model IK of the knowl-
edge base K; in fact, IT is obviously a fragment of IK . It can be computed in
polynomial time and its size is at most quadratic in the size of T . To properly
insert IT into the database, we need a notion of reachability between domain
elements of IT . For xC , xD ∈ ∆IT , we say that xD is reachable from xC if there
are xC0 , . . . , xCn ∈ ∆IT , n ≥ 0, such that xC0 = xC , xCn = xD , and for all i < n,
4
    Thus, Aux denotes the complement of Ind(A)IK , as introduced in Section 3.
for all C, A ∈ sub(T ) such that for some C v D ∈ T , A is a conjunct of D do
                                       (i)
   extend the table A with ans(qC , Adb )
for all C, ∃r.D ∈ sub(T ) such that for some C v E ∈ T , ∃r.D is a conjunct of E do
                (i)
   if ans(qC , Adb ) 6= ∅ then
                                                        (i)
      extend the table r with {(a, xD ) | a ∈ ans(qC , Adb )}
      extend the table Aux with Reach(xD )
      extend the table A with AIT ∩ Reach(xD ), for each concept name A
      extend the table s with sIT ∩ (Reach(xD ) × Reach(xD )), for each role name s
   endif

               Fig. 1. The central procedure for data preprocessing.

there is an r ∈ NR with (xCi , xCi+1 ) ∈ rIT . We use Reach(xC ) to denote the set
of those xD ∈ ∆IT that are reachable from xC (including xC itself). Now, the
                                                               (1)      (k)
data preprocessing phase constructs a sequence of ABoxes Adb , . . . Adb starting
        (0)
with Adb = Adb and then repeatedly applying the procedure from Figure 1 to
            (i+1)       (i)
produce Adb       from Adb . In the figure, qC denotes the result of converting the
concept C into a tree-shaped conjunctive query with one answer variable that
denotes the root (c.f. the remark at the end of Section 2) and “C is a conjunct
of D” includes the case where C = D. The repeated application of this procedure
stops when no more changes to the data occur.
    It is not hard to see that A∗db represents IK restricted to those domain el-
ements that are reachable (along roles) from some individual name in A. This
restriction to reachable elements is not (only) an optimization: it is needed in
order to eliminate the problem illustrated in the fourth example (K4 , q4 ) in pre-
vious section. We now analyze the complexity of the preprocessing phase.
Lemma 3. Let n be the number of individual names in A and m the number of
subconcepts in T . Then the procedure in Figure 1 is applied at most n · m times.
We note that n and m are linear in the size of A and T , and that the bound given
in Lemma 3 is only an upper bound that is unlikely to occur in practice. Let
us use A∗db to denote the final database that is constructed in the preprocessing
phase. It is not hard to see that the size of A∗db is bounded by O(n · m), where
n and m are as in Lemma 3.


5   Query Rewriting

This section shows how to rewrite a given conjunctive query q into a (domain
independent) first-order query q ∗ such that cert(q, K) = ans(q ∗ , A∗db ) for every
knowledge base K = (T , A), i.e., the answers that a relational system returns for
q ∗ are precisely the certain answers for q and K. We assume that A∗db (defined
in the previous section) is stored in a relational database system. We remind
the reader that domain independent first-order queries are nothing else but SQL
queries. In particular, every domain independent FO query can be rewritten into
an SQL query in linear time.
    To proceed, we use several auxiliary definitions. Let ∼q denote the smallest
relation on term(q) that includes the identity relation, is transitive, and satisfies
the following closure condition:

(∗) if there are r(s, t), r(s0 , t0 ) ∈ q with t ∼q t0 , then s ∼q s0 .

The relation ∼q is central to our rewriting procedure. Recall from Section 3 that
the goal of query rewriting is to produce a query q ∗ such that matches of q ∗
in IK can be reproduced as matches of q in the unraveling UK of IK and vice
versa. Intuitively, UK is produced from IK by keeping the Ind(A)-part of IK
intact and relaxing the NIaux -part into a collection of trees. The importance of
(∗) can be seen when assuming t = t0 : then (∗) describes a non-tree situation
in the query since t = t0 has two predecessors s and s0 . Therefore, we should
avoid matches π of q to the NIaux -part of IK where π(s) 6= π(s0 ) as we will not
be able to reproduce them in UK . The case where t ∼q t0 instead of t = t0 can
be understood similarly.
    It is not hard to verify that ∼q is an equivalence relation and can be computed
in time polynomial in the size of q. For each equivalence class ζ of ∼q , choose a
representative tζ ∈ ζ. Let

 – Fork= be the set of pairs (T, ζ) with T ⊆ term(q) and ζ an equivalence class
   of ∼q such that for some r ∈ NR , T = {t ∈ term(q) | ∃t0 ∈ ζ : r(t, t0 ) ∈ q},
   and T is of cardinality at least two;

 – Fork6= ⊆ qvar(q) be the set of quantified variables v such that for some
   s, s0 , t ∈ term(q) and r, r0 ∈ NR with r 6= r0 , we have r(s, v), r0 (s0 , t) ∈ q and
   v ∼q t;

 – Cyc ⊆ qvar(q) be the set of quantified variables v such that there are

                             r0 (t0 , t00 ), . . . , rn−1 (tn , t0n ) ∈ q, n ≥ 0,

    with v = ti for some i ≤ n and t0i ∼q ti+1 mod n for all i < n.

It is not hard to see that Fork= , Fork6= , and Cyc can be computed in time poly-
nomial in the size of q. The rewritten query q ∗ is defined as q ∧ q1 ∧ q2 , where q1
and q2 are as follows:
                       ^
          q1 :=                   ¬Aux(v)
                  v∈avar(q)∪Fork6= ∪Cyc
                           ^
          q2 :=                             ( Aux(tζ ) → (t1 = t2 ∧ · · · ∧ tk−1 = tk ) )
                  ({t1 ,...,tk },ζ)∈Fork=

Note that the construction of q ∗ does not depend on K and can be carried out in
polynomial time. Moreover, q ∗ is at most linear in the size of q. To see this, first
note that the number of conjuncts in q1 is bounded by the number of variables
in q. Regarding q2 , let Fork= = {(T1 , ζ1 ), . . . , (T` , ζ` )}. It is not hard to see that
                        v0                                v1                                v2
                             r                                 r                                 r
                                                                                           
                        v3 A                              v4 A                              v5
                                 AA r                 }            AA r                 }
                                   AA               }}               AA               }}
                                                  }}                                }}
                                                r                                 r
                                     AA          }                     AA          }
                                               }~                                }~
                                          v6                                v7


                                    Fig. 2. Example Query.


|T1 |+· · ·+|T` | is bounded by the number of role atoms in q, and thus the number
of conjuncts in q2 is also bounded linearly in the size of q. Since q is a conjunct
of q ∗ , it is readily checked that q ∗ is domain independent.
     The following theorem states the correctness of our approach to conjunctive
query answering in EL using a database system. Its proof is given in the full
version of this paper available at http://lat.inf.tu-dresden.de/∼clu.
Theorem 1. cert(q, K) = ans(q ∗ , A∗db ).
We now give three examples for query rewriting:
 – For tree-shaped queries and for queries without quantified  variables almost
   no query rewriting is needed: in both cases q ∗ = q ∧ v∈avar(q) ¬Aux(v).
                                                        V


 – The queries q1 to q3 from Section 3 are rewritten as follows. In the case of q1 =
   ∃u.r(v, u) ∧ r(v 0 , u), ∼q consists of the equivalence classes {v, v 0 } and {u}.
   Assume that the chosen representative for {v, v 0 } is v. In this example, q2
   and Fork= are the most important ingredients of the rewriting, and we have
       q1∗ = ∃u.¬Aux(v) ∧ ¬Aux(v 0 ) ∧ r(v, u) ∧ r(v 0 , u) ∧ (Aux(u) → v = v 0 ).
    In the case of q2 = ∃u.r(v, u) ∧ s(v, u), the Fork6= part of q1 is the most
    important ingredient of the rewriting and we have
                   q2∗ = ∃u.¬Aux(v) ∧ ¬Aux(u) ∧ r(v, u) ∧ s(v, u).
    Finally, reconsider q3 = ∃u.r(v, u) ∧ s(u, u). Here, the Cyc part of the rewrit-
    ing plays the crucial role, and we have
                    q3∗ = ∃u.¬Aux(v) ∧ ¬Aux(u) ∧ r(v, u) ∧ s(u, u)
 – Let q be the query shown in Figure 2, where all variables are quantified.
   Then ∼q consists of the equivalence classes {v0 , v1 , v2 }, {v3 , v4 , v5 }, {v6 },
   and {v7 }. Assume that the chosen representative for {v3 , v4 , v5 } is v3 . Thus,
   we have
                     q∗ = q ∧
                          Aux(v6 ) → (v3 = v4 ) ∧
                          Aux(v7 ) → (v4 = v5 ) ∧
                          Aux(v3 ) → ((v0 = v1 ) ∧ (v1 = v2 ))
 – Let qnc be the query that has no answer variables and whose body is an
   n-clique, i.e.,                                 ^
                        qnc = ∃v0 , . . . , vn−1 .   r(vi , vj )
                                                i,j<n

    Then ∼q consists of a single equivalence class {v0 , . . . , vn−1 }. Assume that
    the representative is v0 . Then we have

                (qnc )∗ = qnc ∧ ¬Aux(v0 ) ∧ · · · ∧ ¬Aux(vn−1 ) ∧
                          Aux(v0 ) → ((v0 = v1 ) ∧ · · · ∧ (vn−2 = vn−1 ))

    which can be simplified to the equivalent qnc ∧ ¬Aux(v0 ) ∧ · · · ∧ ¬Aux(vn−1 ).
As illustrated by the last example, we can drop a conjunct from q2 whenever the
variable occurring in its precondition occurs in a conjunct of q1 .


6   Conclusion
We find it very satisfactory that for the widely-used EL fragment of OWL2
there is a rather direct way to use relational database systems for sound and
complete query answering. The work presented in this paper, however, is only
a first step and leaves a lot of room for further investigation. In particular, it
would be interesting to carry out experiments to evaluate the feasibility of our
approach in practice. To reduce the (polynomial) blowup produced by the data
preprocessing phase, one could study a more careful form of preprocessing that
materializes only certain concept and role names in the database and takes care of
the remaining ones using query rewriting. Another obvious direction is to extend
the approach to more expressive variants of EL. For example, we conjecture that
adding the bottom concept and role hierarchies is simple. In contrast, transitive
roles and left/right identity statements might pose more difficulties. Finally, one
could also try to adopt the proposed technique for DL-Lite, thus establishing
an alternative approach to query answering for that family of DLs. Since DL-
Lite is lacking the finite model property, we cannot store a canonical model in a
database (as in the case of EL). However, it is possible that a finite representation
of such a model can be used. In that case, the query rewriting step can be
expected to depend both on the original query and on the TBox.


References
 1. http://www.w3.org/2007/OWL/wiki/OWL Working Group.
 2. F. Baader, S. Brandt, and C. Lutz. Pushing the EL envelope. LTCS-Report
    LTCS-05-01, Chair for Automata Theory, Institute for Theoretical Computer Sci-
    ence, Dresden University of Technology, Germany, 2005. See http://lat.inf.tu-
    dresden.de/research/reports.html.
 3. F. Baader, C. Lutz, and B. Suntisrivaraporn. Tractable reasoning in the EL family
    of description logics. Journal of Logic, Language, and Information, 2007. To
    appear.
 4. F. Baader, S. Brandt, and C. Lutz. Pushing the EL envelope. In Proceedings of the
    Nineteenth International Joint Conference on Artificial Intelligence (IJCAI’05),
    pages 364–369. Professional Book Center, 2005.
 5. F. Baader, S. Brandt, and C. Lutz. Pushing the el envelope further. In In Pro-
    ceedings of OWLED2008, 2008.
 6. D. Calvanese, G. de Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Data
    complexity of query answering in description logics. In Proceedings of KR2006,
    pages 260–270. AAAI Press, 2006.
 7. D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, A. Poggi, and R. Rosati.
    Linking data to ontologies: The description logic DL-LiteA . In Proceedings of
    OWLED2006, volume 216 of CEUR-WS, 2006.
 8. A. Krisnadhi and C. Lutz. Data complexity in the EL family of DLs. In Proceedings
    of DL2007, volume 250 of CEUR-WS, 2007.
 9. M. Krötzsch, S. Rudolph, and P. Hitzler. Conjunctive queries for a tractable
    fragment of OWL 1.1. In Proceedings of ISWC2007, volume 4825 of LNCS, pages
    310–323. Springer, 2007.
10. A.Y. Levy. Obtaining complete answers from incomplete databases. In In Pro-
    ceedings of VLDB1996, pages 402–412, 1996.
11. C. Lutz and F. Wolter. Conservative extensions in the lightweight description logic
    EL. In Proceedings of the 21th Conference on Automated Deduction (CADE-21),
    volume 4603 of LNAI, pages 84–99. Springer, 2007.
12. C. Patel, J.J. Cimino, J. Dolby, A. Fokoue, A. Kalyanpur, A. Kershenbaum, L. Ma,
    E. Schonberg, and K. Srinivas. Matching patient records to clinical trials using
    ontologies. In Proceedings of ISWC2007, volume 4825 of LNCS, pages 816–829.
    Springer, 2007.
13. J. Patrick and P. Budd. Automatic conversion of clinical notes into snomed ct at
    point of care. In Proceedings of HIC2006 and HINZ2006, pages 209–213, 2006.
14. R. Rosati. On conjunctive query answering in EL. In Proceedings of DL2007,
    volume 250 of CEUR-WS, 2007.
15. N. Sioutos, S. de Coronado, M.W. Haber, F.W. Hartel, W.L. Shaiu, and L.W.
    Wright. NCI thesaurus: a semantic model integrating cancer-related clinical and
    molecular information. Journal of Biomedical Informatics, 40(1):30–43, 2006.
16. K.A. Spackman. Managing clinical terminology hierarchies using algorithmic cal-
    culation of subsumption: Experience with SNOMED-RT. Journal of the American
    Medical Informatics Association, 2000.
17. H.J. ter Horst. Completeness, decidability and complexity of entailment for rdf
    schema and a semantic extension involving the owl vocabulary. Journal of Web
    Semantics, 3(2-3):79–115, 2005.
18. Z. Wu, G. Eadon, S. Das, E.I. Chong, V. Kolovski, M. Annamalai, and J. Srini-
    vasan. Implementing an inference engine for RDFS/OWL constructs and user-
    defined rules in oracle. In Proceedings of ICDE2008, pages 1239–1248. IEEE,
    2008.