Exact Learning Description Logic Ontologies from Data
                Retrieval Examples

                         Boris Konev, Ana Ozaki, and Frank Wolter

                 Department of Computer Science, University of Liverpool, UK


         Abstract. We investigate the complexity of learning description logic ontologies
         in Angluin et al.’s framework of exact learning via queries posed to an oracle. We
         consider membership queries of the form “is individual a a certain answer to a data
         retrieval query q in a given ABox and the unkown target TBox?” and equivalence
         queries of the form “is a given TBox equivalent to the unknown target TBox?”. We
         show that (i) DL-Lite TBoxes with role inclusions and ELI concept expressions
         on the right-hand side of inclusions and (ii) EL TBoxes without complex concept
         expressions on the right-hand side of inclusions can be learned in polynomial time.
         Both results are proved by a non-trivial reduction to learning from subsumption
         examples. We also show that arbitrary EL TBoxes cannot be learned in polynomial
         time.


1     Introduction
Building an ontology is prone to errors, time consuming, and costly. The research com-
munities has addressed this problem in many different ways, for example, by supplying
tool support for editing ontologies [15, 4, 9], developing reasoning support for debug-
ging ontologies [18], supporting modular ontology design [17], and by investigating
automated ontology generation from data or text [8, 6, 5, 14]. One major problem when
building an ontology is the fact that domain experts are rarely ontology engineering ex-
perts and that, conversely, ontology engineers are typically not familiar with the domain
of the ontology. An ontology building project therefore often relies on the successful
communication between an ontology engineer (familiar with the semantics of ontology
languages) and a domain expert (familiar with the domain of interest). In this paper, we
consider a simple model of this communication process and analyse, within this model,
the computational complexity of reaching a correct domain ontology. We assume that
    – the domain expert knows the domain ontology and its vocabulary without being able
      to formalize or communicate this ontology;
    – the domain expert is able to communicate the vocabulary of the ontology and shares
      it with the the ontology engineer. Thus, the domain expert and ontology engineer
      have a common understanding of the vocabulary of the ontology. The ontology
      engineer knows nothing else about the domain.
    – the ontology engineer can pose queries to the domain expert which the domain
      expert answers truthfully. Assuming that the domain expert can interpret data in
      her area of expertise, the main queries posed by the ontology engineer are based on
      instance retrieval examples:
      • assume a data instance A and a query q(x) are given. Is the individual a a
         certain answer to query q(x) in A and the ontology O?
    In addition, we require a way for the ontology engineer to find out whether she
    has reconstructed the target ontology already and, if this is not the case, to request
    an example illustrating the incompleteness of the reconstruction. We abstract from
    defining a communication protocol for this, but assume for simplicity that the
    following query can be posed by the ontology engineer:
      • Is this ontology H complete? If not, return a data instance A, a query q(x), and
         an individual a such that a is a certain answer to q(x) in A and the ontology O
         and it is not a certain answer to q(x) in A and the ontology H.

Given this model, our question is whether the ontology engineer can learn the target
ontology O and which computational resources are required for this depending on the
ontology language in which the ontology O and the hypothesis ontologies H are formu-
lated. Our model obviously abstracts from a number of fundamental problems in building
ontologies and communicating about them. In particular, it makes the assumption that
the domain expert knows the domain ontology and its vocabulary (without being able
to formalize it) despite the fact that finding an appropriate vocabulary for a domain of
interest is a major problem in ontology design [8]. We make this assumption here in
order to isolate the problem of communication about the logical relationships between
known vocabulary items and its dependence on the ontology language within which the
relationships can be formulated.
    The model described above is an instance of Angluin et al.’s framework of exact
learning via queries to an oracle [1]. The queries using instance retrieval examples can
be regarded as membership queries posed by a learner to an oracle and the completeness
query based on a hypothesis H can be regarded as an equivalence query by the learner to
the oracle. Formulated in Angluin’s terms we are thus interested in whether there exists
a deterministic learning algorithm that poses membership and equivalence queries of
the above form to an oracle and that learns an arbitrary ontology over a given ontology
language in polynomial time. We consider polynomial learnability in three distinct DLs:
we show that DL-Lite ontologies with role inclusions and arbitrary ELI concepts on
the right-hand side of concept inclusions can be learned in polynomial time if database
queries in instance retrieval examples are ELI instance queries (or, equivalently, acyclic
conjunctive queries). We call this DL DL-Lite∃R and note that it is the core of the web
ontology language profile OWL2 QL. We also note that without complex ELI concepts
on the right-hand side of concept inclusions, polynomial learnability would be trivial as
only finitely many non-equivalent such TBoxes exist over a given vocabulary of concept
and role names. The second DL we consider is EL which is the logic underpinning the
web ontology language profile OWL2 EL. We show that EL TBoxes cannot be learned
in polynomial time using the protocol above if the database queries in instance retrieval
examples are EL instance queries. We then consider the fragment ELlhs of EL without
complex concepts on the right-hand side of concept inclusions and prove that it can be
learned in polynomial time using the above protocol with instance retrieval examples.
The proofs of the positive learning results are by reduction to polynomial time learnability
results for DL-Lite∃R and ELlhs for the case in which concept subsumptions rather than
instance retrieval examples are used in the communication between the learner and the
oracle [12]. Our move from concept subsumptions to data retrieval examples is motivated
by the observation that domain experts are often more familiar with querying data in their
domain than with the logical notion of subsumption between complex concepts. Detailed
proofs are provided at http://csc.liv.ac.uk/˜frank/publ/publ.html.


2      Preliminaries
Let NC and NR be countably infinite sets of concept and role names, respectively. The
dialect DL-Lite∃R of DL-Lite is defined as follows [7]. A role is a role name or an inverse
role r− with r ∈ NR . A role inclusion (RI) is of the form r v s, where r and s are
roles. A basic concept is either a concept name or of the form ∃r.>, with r a role. A
DL-Lite∃R concept inclusion (CI) is of the form B v C, where B is a basic concept
expression and C is an ELI concept expression, that is, C is formed according to the
rule C, D := A | > | C u D | ∃r.C | ∃r− .C where A ranges over NC and r ranges
over NR . A DL-Lite∃R TBox is a finite set of DL-Lite∃R CIs and RIs. As usual, an EL
concept expression is an ELI concept expression that does not use inverse roles, an EL
concept inclusion has the form C v D with C and D EL concept expressions, and a
(general) EL TBox is a finite set of EL concept inclusions [2]. We also consider the
restriction ELlhs of general EL TBoxes where only concept names are allowed on the
right-hand side of concept inclusions. The size of a concept expression C, denoted with
|C|, is the length of the string that represents it, where concept names and role names are
considered to be of length one. A TBox signature is the set of concept P and role names
occurring in the TBox. The size of a TBox T , denoted with |T |, is CvD∈T |C| + |D|.
    Let NI be a countably infinite set of individual names. An ABox A is a finite non-
empty set containing concept name assertions A(a) and role assertions r(a, b), where
a, b are individuals in NI , A is a concept name and r is a role. Ind(A) denotes the set of
individuals that occur in A. A is a singleton ABox if it contains only one ABox assertion.
Assertions of the form C(a) and r(a, b), where a, b ∈ NI , C an ELI concept expression,
and r ∈ NR , are called instance assertions. Note that instance assertions of the form
C(a) with C not a concept name nor C = > do not occur in ABoxes. The semantics
of description logic is defined as usual [3]. We write I |= α to say that an inclusion or
assertion α is true in I. An interpretation I is a model of a KB (T , A) if I |= α for all
α ∈ T ∪ A. (T , A) |= α means that I |= α for all models I of (T , A).
    A learning framework F is a triple (X, L, µ), where X is a set of examples (also
called domain or instance space), L is a set of learning concepts1 and µ is a mapping
from L to 2X . The subsumption learning framework FS , studied in [12], is defined as
(XS , L, µS ), where L is the set of all TBoxes that are formulated in a given DL; XS is
the set of subsumption examples of the form C v D, where C, D are concept expressions
of the DL under consideration; and µS (T ) is defined as {C v D ∈ XS | T |= C v D},
for every T ∈ L. It should be clear that µS (T ) = µS (T 0 ) if, and only if, the TBoxes T
and T 0 entail the same set of inclusions, that is, they are logically equivalent.
 1
     In the learning literature (e.g., [1]), the term ‘learning concept’ is often defined as a set of
     examples. We do not distinguish between learning concepts and their representations and only
     consider representable learning concepts to emphasize on the task of identifying a TBox that is
     logically equivalent to the target TBox.
    We study the data retrieval learning framework FD defined as (XD , L, µD ), where
L is same as in FS ; X is the set of data retrieval examples of the form (A, D(a)),
where A is an ABox, D(a) is a concept assertion of the DL under consideration, and
a ∈ Ind(A); and µ(T ) = {(A, D(a)) ∈ XD | (T , A) |= D(a)}. As in the case of
learning from subsumptions, µS (T ) = µS (T 0 ) if, and only if, the TBoxes T and T 0 are
logically equivalent.
    Given a learning framework F = (X, L, µ), we are interested in the exact identi-
fication of a target learning concept l ∈ L by posing queries to oracles. Let MEMl,X
be the oracle that takes as input some x ∈ X and returns ‘yes’ if x ∈ µ(l) and ‘no’
otherwise. We say that x is a positive example for l if x ∈ µ(l) and a negative example
for l if x 6∈ µ(l). Then a membership query is a call to the oracle MEMl,X . Similarly,
for every l ∈ L, we denote by EQl,X the oracle that takes as input a hypothesis learning
concept h ∈ L and returns ‘yes’, if µ(h) = µ(l), or a counterexample x ∈ µ(h) ⊕ µ(l)
otherwise, where ⊕ denotes the symmetric set difference. An equivalence query is a call
to the oracle EQl,X .
    We say that a learning framework (X, L, µ) is exact learnable if there is an algorithm
A such that for any target l ∈ L the algorithm A always halts and outputs l0 ∈ L such
that µ(l) = µ(l0 ) using membership and equivalence queries answered by the oracles
MEMl,X and EQl,X , respectively. A learning framework (X, L, µ) is polynomially
exact learnable if it is exact learnable by an algorithm A such that at every step2 of
computation the time used by A up to that step is bounded by a polynomial p(|l|, |x|),
where l is the target and x ∈ X is the largest counterexample seen so far3 . As argued in
the introduction, for learning subsumption and data retrieval learning frameworks we
additionally assume that the signature of the target TBox is always known to the learner.
    An important class of learning algorithms—in particular, all algorithms presented
in [12, 10, 16] fit in this class—always make equivalence queries with hypotheses h
which are polynomial in the size of l and such that µ(h) ⊆ µ(l), so that counterexamples
returned by the EQl,X oracles are always positive. We say that such algorithms use
positive bounded equivalence queries.


3      Polynomial Time Learnability
In this section we prove polynomial time exact learnability of the DL-Lite∃R and ELlhs
data retrieval learning frameworks. These frameworks are instances of the general
definition given above, where the concept expression D in a data retrieval example
(A, D(a)) is an ELI concept expression in the DL-Lite∃R framework and an EL concept
expression in the ELlhs framework, respectively.
    The proof is by reduction to learning from subsumptions. We illustrate its idea for
ELlhs . To learn a TBox from data retrieval examples we run a learning from subsumptions
algorithm as a ‘black box’. Every time the learning from subsumptions algorithm makes
a membership or an equivalence query we rewrite the query into the data setting and pass
it on to the data retrieval oracle. The oracle’s answer, rewritten back to the subsumption
 2
     We count each call to an oracle as one step of computation.
 3
     We assume some natural notion of a length of an example x and a learning concept l, denoted
     |x| and |l|, respectively.
                                A             A A            A       A            A A            A
                                r    ..       s r       ..   s       r   ..       s r       ..   s
                                      .                  .                .                  .
           r,s                            r         s                         r         s
                                               A                                   A
           A                                        r            A        s


       Fig. 1: An ABox A = {r(a, a), s(a, a), A(a)} and its unravelling up to level n.


setting, is given to the learning from subsumptions algorithm. When the learning from
subsumptions algorithm terminates we return the learnt TBox. This reduction is made
possible by the close relationship between data retrieval and subsumption examples. For
every TBox T and inclusions C v D, one can interpret a concept expression C as a
labelled tree and encode this tree as an ABox AC with root ρC such that T |= C v D
iff (T , AC ) |= D(ρC ).
     Then, membership queries in the subsumption setting can be answered with the
help of a data retrieval oracle due to the relation between subsumptions and instance
queries described above. An inclusion C v D is a (positive) subsumption example
for some target TBox T if, and only if, (AC , D(ρC )) is a (positive) data retrieval
example for the same target T . To handle equivalence queries, we need to be able to
rewrite data retrieval counterexamples returned by the data retrieval oracle into the
subsumption setting. For every TBox T and data retrieval query (A, D(a)) one can
construct a concept expression CA such that (T , A) |= D(a) iff T |= CA v D. Such
a concept expression CA can be obtained by unravelling A into a tree-shaped ABox
and representing it as a concept expression. This unravelling, however, can increase the
ABox size exponentially. Thus, to obtain a polynomial bound on the running time of the
learning process, CA v D cannot be simply returned as an answer to a subsumption
equivalence query. For example, for a target TBox T = {∃rn .A v B} and a hypothesis
H = ∅ the data retrieval query (A, B(a)), where A = {r(a, a), s(a, a), A(a)}, is
a positive counterexample. The tree-shaped unravelling of A up to level n is a full
binary tree of depth n, as shown in Fig. 1. On the other hand, the non-equivalence of
T and H can already be witnessed by (A0 , B(a)), where A0 = {r(a, a), A(a)}. The
unravelling of A0 up to level n produces a linear size ABox {r(a, a2 ), r(a2 , a3 ), . . . ,
r(an−1 , an ), A(a), A(a2 ), . . . , A(an )}, corresponding to the left-most path in Fig. 1,
which, in turn, is linear-size w.r.t. the target inclusion ∃rn .A v B. Notice that A0
is obtained from A by removing the s(a, a) edge and checking, using membership
queries, whether (T , A0 ) |= q still holds. In other words, one might need to ask further
membership queries in order to rewrite answers to data retrieval equivalence queries
given by the data retrieval oracle into the subsumption setting.
     We address the need of rewriting counterexamples by introducing an abstract notion
of reduction between different exact learning frameworks. To simplify notation, we
assume that both learning frameworks use the same set of learning concepts L and only
consider positive bounded equivalence queries. This definition of reduction can be easily
extended to arbitrary learning frameworks and arbitrary queries.
    We say that a learning framework F = (X, L, µ) polynomially reduces to F0 =
(X , L, µ0 ) if for some polynomials p1 (·), p2 (·) and p3 (·, ·) there exist a function f :
   0

X 0 → X and a partial function g : L × L × X → X 0 , defined for every (l, h, x) such
that |h| = p1 (|l|), µ(h) ⊆ µ(l) and x ∈ X, for which the following conditions hold.
  – For all x0 ∈ X 0 we have x0 ∈ µ0 (l) if, and only if, f (x0 ) ∈ µ(l);
  – For all x ∈ X we have x ∈ µ(l) \ µ(h) if, and only if, g(l, h, x) ∈ µ0 (l) \ µ0 (h);
  – f (x0 ) is computable in time p2 (|x0 |);
  – g(l, h, x) is computable in time p3 (|l|, |x|) and l can only be accessed by calls to the
    membership oracle MEMl,X .
As in the case of learning algorithms, we consider every call to the oracle as one step
of computation. Notice also that even though g takes h as input, the polynomial time
bound on computing g(l, h, x) does not depend on the size of h as g is only defined for
h polynomial in the size of l.

Theorem 1. Let (X, L, µ) and (X 0 , L, µ0 ) be learning frameworks. If there exists a
polynomial reduction from (X, L, µ) to (X 0 , L, µ0 ) and a polynomial learning algorithm
for (X 0 , L, µ0 ) that uses membership queries and positive bounded equivalence queries
then (X, L, µ) is polynomially exact learnable.

   We use Theorem 1 to prove that DL-Lite∃R and ELlhs TBoxes can be learned in
polynomial time from data retrieval examples. We employ the following result:
Theorem 2 ([12]). The DL-Lite∃R and ELlhs subsumption learning frameworks are
polynomial time exact learnable with membership and positive bounded equivalence
queries.
As the existence of f is guaranteed by the following lemma, in what follows we prove
the existence of g and establish the corresponding time bounds.
Lemma 1. Let L ∈ {DL-Lite∃R , ELlhs } and let C v D be an L concept inclusion. Then
(T , AC ) |= D(ρC ) if, and only if, T |= C v D.

Polynomial Reduction for DL-Lite∃       R TBoxes We show for any target T and hy-
pothesis H polynomial in the size of T that Algorithm 1 transforms every positive
counterexample in polynomial time to a positive counterexample with a singleton ABox
(i.e., of the form {A(a)} or {r(a, b)}). Using the equivalences (T , {A(a)}) |= C(a) iff
T |= A v C and (T , {r(a, b)}) |= C(a) iff T |= ∃r.> v C, we then obtain a positive
subsumption counterexample, so g(l, h, x) is computable in polynomial time.
     Given a positive data retrieval counterexample (A, C(a)), Algorithm 1 exhaustively
applies the role saturation and parent-child merging rules introduced in [12]. We say that
an instance assertion C(a) is role saturated for (T , A) if (T , A) 6|= C 0 (a) whenever
C 0 is the result of replacing a role r by some role s ∈ NR ∩ ΣT with T 6|= r v s and
T |= s v r, where ΣT is the signature of the target TBox T known to the learner.
To define parent/child merging, we identify each ELI concept C with a finite tree TC
whose nodes are labeled with concept names and edges are labeled with roles in the
standard way. For example, if C = ∃t.(A u ∃r.∃r− .∃r.B) u ∃s.> then Fig. 2a illustrates
Algorithm 1 Reducing the positive counterexample
 1: Let C(a) be an instance assertion such that (H, A) 6|= C(a) and (T , A) |= C(a)
 2: function R EDUCE C OUNTER E XAMPLE( A, C(a) )
 3:     Find a role saturated and parent/child merged C(a) (membership queries)
 4:     if C = C0 u ... u Cn then
 5:         Find Ci , 0 ≤ i ≤ n, such that (H, A) 6|= Ci (a)
 6:         C := Ci
 7:     if C = ∃r.C 0 and there is r(a, b) ∈ A such that (T , A) |= C 0 (b) then
 8:         R EDUCE C OUNTER E XAMPLE( A, C 0 (b) )
 9:     else
10:          Find a singleton A0 ⊆ A such that (T , A0 ) |= C(a) but
11:          (H, A0 ) 6|= C(a) (membership queries)
12:          return (A0 ,C(a))


TC . Now, we say that an instance assertion C(a) is parent/child merged for T and A
if for nodes n1 , n2 , n3 in TC such that n2 is an r-successor of n1 , n3 is an s-successor
of n2 and T |= r− ≡ s we have (T , A) 6|= C 0 (a) if C 0 is the concept that results from
identifying n1 and n3 . For instance, the concept in Fig. 2c is the result of identifying the
leaf labeled with B in Fig. 2b with the parent of its parent.
     We present a run of Algorithm 1 for T = {A v ∃s.B, s v r} and H = {s v r}. As-
sume the oracle gives as counterexample (A, C(a)), where A = {t(a, b), A(b), s(a, c)}
and C(a) = ∃t.(A u ∃r.∃r− .∃r.B) u ∃s.>(a) (Fig. 2a). Role saturation produces
C(a) = ∃t.(A u ∃s.∃s− .∃s.B) u ∃s.>(a) (Fig. 2b). Then, applying parent/child merg-
ing twice we obtain C(a) = ∃t.(A u ∃s.B) u ∃s.>(a) (Fig. 2c and 2d).


                           B                          B
         (a)           r            (b)           s        (c)               (d)
                   r                          s                      s
                                                                         B                 B
               r                          s                      s                     s
                   A                          A                      A                     A
          s    t                     s    t                 s    t           s     t

                   Fig. 2: Concept C being role saturated and parent/child merged.

    Since (H, A) 6|= ∃t.(A u ∃s.B)(a), after Lines 4-6, Algorithm 1 updates C by
choosing the conjunct ∃t.(A u ∃s.B). As C is of the form ∃t.C 0 and there is t(a, b) ∈ A
such that (T , A) |= C 0 (b), the algorithm recursively calls the function “ReduceCoun-
terExample” with A u ∃s.B(b). Now, since (H, A) 6|= ∃s.B(b), after Lines 4-6, C
is updated to ∃s.B. Finally, C is of the form ∃t.C 0 and there is no t(b, c) ∈ A such
that (T , A) |= C 0 (c). So the algorithm proceeds to Lines 11-12, where it chooses
A(b) ∈ A. Since (T , {A(b)}) |= ∃s.B(b) and (H, {A(b)}) 6|= ∃s.B(b) we have that
T |= A v ∃s.B and H 6|= A v ∃s.B.

Lemma 2. Let (A, C(a)) be a positive counterexample. Then the following holds:
 1. if C is a basic concept then there is a singleton A0 ⊆ A such that (T , A0 ) |= C(a);
Algorithm 2 Minimizing an ABox A
 1: Let A be an ABox such that (T , A) |= A(a) but (H, A) 6|= A(a), for A ∈ NC , a ∈ Ind(A).
 2: function M INIMIZE AB OX( A)
 3:     Concept saturate A with H
 4:     for every A ∈ NC ∩ ΣT and a ∈ Ind(A) such that
 5:         (T , A) |= A(a) and (H, A) 6|= A(a) do
 6:         Domain Minimize A with A(a)
 7:         Role Minimize A with A(a)
 8:     return (A)


 2. if C is of the form ∃r.C 0 (or ∃r− .C 0 ) and C is role saturated and parent/child
    merged then either there is r(a, b) ∈ A (or r(b, a) ∈ A ) such that (T , A) |= C 0 (b)
    or there is a singleton A0 ⊆ A such that (T , A0 ) |= C(a).

Lemma 3. For any target DL-Lite∃R TBox T and hypothesis DL-Lite∃R TBox H given
a positive data retrieval counterexample (A, C(a)), Algorithm 1 computes in time
polynomial in |T |, |H|, |A| and |C| a counterexample C 0 (b) such that (T , A0 ) |= C 0 (b),
where A0 ⊆ A is a singleton ABox.

Proof. (Sketch) Let (A, C(a)) be the input of “ReduceCounterExample”. The number
of membership queries in Line 3 is polynomial in |C| and |T |. If C has more than
one conjunct then it is updated in Lines 4-6, so C becomes either (1) a basic concept
or (2) of the form ∃r.C 0 (or ∃r− .C 0 ). By Lemma 2 in case (1) there is a singleton
A0 ⊆ A such that (T , A0 ) |= C(a), computed by Line 11 of Algorithm 1. In case (2)
either there is a singleton A0 ⊆ A such that (T , A0 ) |= C(a), computed by Line 11 of
Algorithm 1, or we obtain a counterexample with a refined C. Since the size of the refined
counterexample is strictly smaller after every recursive call of “ReduceCounterExample”,
the total number of calls is bounded by |C|.                                            o
Using Theorem 2 and Theorem 1 we obtain:
Theorem 3. The DL-Lite∃R data retrieval framework is polynomially exact learnable.

Polynomial Reduction for ELlhs TBoxes In this section we give a polynomial algo-
rithm computing g for ELlhs . First we note that the concept assertion in data retrieval
counterexamples (A, D(a)) can always be made atomic. Let ΣT be the signature of the
target TBox T .
Lemma 4. If (A, D(a)) is a positive counterexample then by posing polynomially many
membership queries one can find a concept name A ∈ ΣT and an individual b ∈ Ind(A)
such that (A, A(b)) is also a counterexample.
Thus it suffices to show that given a positive counterexample (A, D(a)) with D ∈
NC , one can compute an EL concept expression C bounded in size by |T | such that
(T , {C(b)}) |= A(b) and (H, {C(b)}) 6|= A(b), where A ∈ NC . As (T , {C(b)}) |=
A(b) if and only if T |= C v A, we obtain a positive subsumption counterexample.
Our algorithm for computing g is based on two operations: minimization, computed by
Algorithm 3 Computing a tree shaped ABox
 1: function F IND T REE( A)
 2:    M INIMIZE AB OX( A)
 3:    while there is a cycle c in A do
 4:        Unfold a ∈ Ind(A) in cycle c
 5:        M INIMIZE AB OX( A)
 6:    Let C be the concept expression corresponding to A with counterexample A(a).
 7:    return (C(a),A(a))


Algorithm 2, and unfolding. Algorithm 2 minimizes a given ABox with the following
rules.
    (Concept saturate A with H) If A(a) ∈ / A and (H, A) |= A(a) then replace A by
A ∪ {A(a)}, where A ∈ NC ∩ ΣT and a ∈ Ind(A).
    (Domain Minimize A with A(a)) If A(a) is a counterexample and (T , A−b ) |= A(a)
then replace A by A−b , where A−b is the result of removing from A all ABox assertions
in which b occurs.
    (Role Minimize A with A(a)) If A(a) is a counterexample and (T , A−r(b,c) ) |=
A(a) then replace A by A−r(b,c) , where A−r(b,c) be obtained by removing a role
assertion r(b, c) from A.
Lemma 5. Given a positive counterexample (A, D(a)) with D ∈ NC , Algorithm 2
computes in polynomially many steps with respect to |A|, |H|, and |T | an ABox A0 such
that |Ind(A0 )| ≤ |T | and (A0 , A(b)) is a positive counterexample, for some A ∈ NC and
b ∈ Ind(A0 ).
It remains to show that A can be made tree-shaped. We say that A has an (undirected)
cycle if there is a finite sequence a0 · r1 · a1 · ... · rk · ak such that (i) a0 = ak and (ii) there
are mutually distinct assertions of the form ri+1 (ai , ai+1 ) or ri+1 (ai+1 , ai ) in A, for
0 ≤ i < k. The unfolding of a cycle c = a0 ·r1 ·a1 ·...·rk ·ak in a given ABox A is obtained
by replacing c by the cycle c0 = a0 ·r1 ·a1 ·...·rk ·ak−1 ·rk ·b      a0 ·r1 · · · b
                                                                                   ak−1 ·rk ·a0 , where
ai are fresh individual names, 0 ≤ i ≤ k −1, in such a way that (i) if r(ai , d) ∈ A, for an
b
individual d not in the cycle, then r(b  ai , d) ∈ A; and (ii) if A(ai ) ∈ A then A(b        ai ) ∈ A.
     We prove in the full version that after every unfolding-minimisation step in Algo-
rithm 3 the ABox A on the one hand becomes strictly larger and on the other does not
exceed the size of the target TBox T . Thus Algorithm 3 terminates after a polynomial
number of steps yielding a tree-shaped counterexample.
Lemma 6. Algorithm 3 computes a minimal tree shaped ABox A with size polynomial
in |T | and runs in polynomially many steps in |T | and |A|.
Using Theorem 2 and Theorem 1 we obtain:
Theorem 4. The ELlhs data retrieval framework is polynomially exact learnable.

4    Limits of Polynomial Time Learnability
Our proof of non-polynomial learnability of general EL TBoxes from data retrieval
examples extends previous results on non-polynomial learnability of EL TBoxes from
subsumptions [12]. We start by giving a brief overview of the construction in [12], show
that it fails in the data retrieval setting and then demonstrate how it can be modified.
     The non-learnability proof in [12] proceeds as follows. A learner tries to exactly
identify one of the possible target TBoxes {TL | L ∈ Ln }, for a superpolynomial in
n set Ln defined below. At every stage of computation the oracle maintains a set of
TBoxes S, which the learner is unable to distinguish based on the answers given so far.
Initially S = {TL | L ∈ Ln }. It has been proved that for any EL inclusion C v D either
TL |= C v D for every L ∈ Ln or the number of L ∈ Ln such that TL |= C v D
does not exceed |C|. When a polynomial learner asks a membership query C v D the
oracle answers ‘yes’ if TL |= C v D for every L ∈ Ln and ‘no’ otherwise. In the
latter case the oracle removes polynomially many TL such that TL |= C v D from S.
Similarly, for any equivalence query with hypothesis H asked by a polynomial learning
algorithm there exists a polynomial size inclusion C v D, which can be returned as a
counterexample and that excludes only polynomially many TBoxes from S. Thus, every
query to the oracle reduces the size of S at most polynomially in n, but then the learner
cannot distinguish between the remaining TBoxes of our initial superpolynomial set S.
     The set of indices Ln and the target TBoxes TL are defined as follows. Fix two
role names r and s. An n-tuple L is a sequence of role sequences (σ1 , . . . , σn ), where
every σi is a sequence of role names r and s, that is σi = σi1 σi2 . . . σin with σij ∈ {r, s}.
Then Ln is a set of n-tuples such that for every L, L0 ∈ Ln with L = (σ1 , . . . , σn ),
L0 = (σ10 , . . . , σn0 ), if σi = σj0 then L = L0 and i = j. There are N = b2n /nc different
tuples in Ln . For every n > 0 and every n-tuple L = (σ1 , . . . , σn ) we define an acyclic
EL TBox TL as the union of T0 = {Xi v ∃r.Xi+1 u ∃s.Xi+1 | 0 ≤ i < n} and the
following inclusions:

                     A1 v ∃σ1 .M u X0                 An v ∃σn .M u X0
                                               ...
                     B1 v ∃σ1 .M u X0                 Bn v ∃σn .M u X0
                             A ≡ X0 u ∃σ1 .M u · · · u ∃σn .M.

where the expression ∃σ.C stands for ∃σ 1 .∃σ 2 . . . ∃σ n .C, M is a concept name that
‘marks’ a designated path given by σ and T0 generates a full binary tree whose edges are
 labelled with the role names r and s and with X0 at the root, X1 at level 1 and so on.
     In contrast to the subsumption framework, every TL can be exactly identified using
 data retrieval queries. For example, as X0 u ∃σ1 .M u · · · u ∃σn .M v A ∈ TL , a
 learning from data retrieval queries algorithm can learn all the sequences in the n-
 tuple L = (σ1 , . . . , σn ), by defining an ABox A = {X0 (a1 ), r(a1 , a2 ), s(a1 , a2 ), . . . ,
 r(an−1 , an ), s(an−1 , an ), M (an )} and then proceeding with unfolding and minimizing
 A via membership queries of the form (TL , A) |= A(a1 ).
     To show the non-tractability for data retrieval queries, we first modify S in such a
way that the concept expression which ‘marks’ the sequences in L = (σ1 , . . . , σn ) is
 now given by the set Bn of all conjunctions F1 u · · · u Fn , where Fi ∈ {Ei , Ēi }, for
1 ≤ i ≤ n. Intuitively, every member of Bn encodes a binary string of length n with Ei
 encoding 1 and Ēi encoding 0. For every L ∈ Ln and every B ∈ Bn we define TLB as
 the union of T0 and the concept inclusions defined above with B replacing M .
     Then for any sequence σ of length n there exists at most one L ∈ Ln , at most
 one 1 ≤ i ≤ n and at most one B ∈ Bn such that TLB |= Ai v ∃σ.B and TLB |=
Bi v ∃σ.B. Notice that the size of each TLB is polynomial in n and so Ln contains
                                                       B
                                          dn of each TL , with L ∈ Ln and B ∈ Bn .
superpolynomially many n-tuples in the size
         B
Every TL entails, among other inclusions, i=1 Ci v A, where every Ci is either Ai or
Bi . Let Σn be the signature of the TBoxes TLB and consider a TBox T ∗ defined as the
following set of concept inclusions:

                ∃r.(E1 u Ē1 ) v (E1 u Ē1 ), (E1 u Ē1 ) v ∃r.(E1 u Ē1 ),
                ∃s.(E1 u Ē1 ) v (E1 u Ē1 ), (E1 u Ē1 ) v ∃s.(E1 u Ē1 ),
             (Ei u Ēi ) v A for every 1 ≤ i ≤ n and every A ∈ Σn ∩ NC
    The basic idea of extending our TBoxes with T ∗ is that if a ∈ (Ei u Ēi )IA , for
an ABox A and individual a ∈ Ind(A), then for all L ∈ Ln and B ∈ Bn , we have
(TLB , A) |= D(b), where D is any EL concept expression over Σn and b ∈ Ind(A) is
any successor or predecessor of a (or a itself). This means that for each individual in
A at most one B of the 2n binary strings in Bn can be distinguished by data retrieval
queries. The following lemma enables us to respond to membership queries without
eliminating too many L ∈ Ln and B ∈ Bn used to encode TLB in the set of TBoxes that
the learner cannot distinguish.
Lemma 7. For any ABox A, any EL concept assertion D(a) over Σn , and any a ∈
Ind(A), if there is L ∈ Ln and B ∈ Bn such that (TLB ∪ T ∗ , A) |= D(a) then:
    – either (TLB ∪ T ∗ , A) |= D(a), for every L ∈ Ln and B ∈ Bn , or
    – (TLB ∪ T ∗ , A) |= D(a) for at most |D| elements L ∈ Ln , or
    – (TLB ∪ T ∗ , A) |= D(a) for at most |A| elements B ∈ Bn .
   The next lemma is immediate from Lemma 15 presented in [12]. It shows how the
oracle can answer equivalence queries eliminating at most one L ∈ Ln used to encode
TLB in the set S of TBoxes that the learner cannot distinguish.
Lemma 8. For any n > 1 and any EL TBox H in Σn with |H| < 2n , there exists an
ABox A, an individual a ∈ Ind(A) and an EL concept expression D over Σn such that
(i) the size of A plus the size of D does not exceed 6n and (ii) if (H, A) |= D(a) then
(TLB , A) |= D(a) for at most one L ∈ Ln and if (H, A) 6|= D(a) then for every L ∈ Ln
we have (TLB ∪ T ∗ , A) |= D(a).
    Then, by Lemmas 7 and 8, we have that: (i) any polynomial size membership query
can distinguish at most polynomially many TBoxes from S; and (ii) if the learner’s
hypothesis is polynomial size then there exists a polynomial size counterexample that
the oracle can give which distinguishes at most polynomially many TBoxes from S.
Theorem 5. The EL data retrieval framework is not polynomially exact learnable.

5     Future Work
We plan to consider an extension of the learning protocol in which arbitrary conjunctive
queries are admitted in queries to the domain expert/oracle. We then still have polynomial
time learnability for ELlhs but conjecture non-polynomial time learnability for DL-Lite∃R .
Another extension is exact learnability for the Horn-extension of DL-Lite∃R for which
we conjecture that polynomial time learnability still holds.
                                 Bibliography


 [1] D. Angluin. Queries and concept learning. Machine Learning, 2(4):319–342, 1987.
 [2] F. Baader, S. Brandt, and C. Lutz. Pushing the EL envelope. In IJCAI, pages
     364–369. Professional Book Center, 2005.
 [3] F. Baader, D. Calvanese, D. McGuiness, D. Nardi, and P. Patel-Schneider. The De-
     scription Logic Handbook: Theory, implementation and applications. Cambridge
     University Press, 2003.
 [4] S. Bechhofer, I. Horrocks, C. Goble, and R. Stevens. Oiled: a reason-able ontology
     editor for the semantic web. In KI 2001: Advances in Artificial Intelligence, pages
     396–408. Springer, 2001.
 [5] D. Borchmann and F. Distel. Mining of EL-GCIs. In The 11th IEEE International
     Conference on Data Mining Workshops, Vancouver, Canada, 11 December 2011.
     IEEE Computer Society.
 [6] P. Buitelaar, P. Cimiano, and B. Magnini, editors. Ontology Learning from Text:
     Methods, Evaluation and Applications. IOS Press, 2005.
 [7] D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable
     reasoning and efficient query answering in description logics: The DL-Lite family.
     Journal of Automated reasoning, 39(3):385–429, 2007.
 [8] P. Cimiano, A. Hotho, and S. Staab. Learning concept hierarchies from text corpora
     using formal concept analysis. J. Artif. Intell. Res. (JAIR), 24:305–339, 2005.
 [9] J. Day-Richter, M. A. Harris, M. Haendel, S. Lewis, et al. Obo-edit an ontology
     editor for biologists. Bioinformatics, 23(16):2198–2200, 2007.
[10] M. Frazier and L. Pitt. Learning From Entailment: An Application to Propositional
     Horn Sentences. In ICML, pages 120–127, 1993.
[11] B. Konev, M. Ludwig, D. Walther, and F. Wolter. The logical difference for the
     lightweight description logic EL. J. Artif. Intell. Res. (JAIR), 44:633–708, 2012.
[12] B. Konev, C. Lutz, A. Ozaki, and F. Wolter. Exact learning of lightweight descrip-
     tion logic ontologies. In Principles of Knowledge Representation and Reasoning:
     Proceedings of the Fourteenth International Conference, KR 2014, Vienna, Austria,
     July 20-24, 2014, 2014.
[13] C. Lutz, R. Piro, and F. Wolter. Description logic TBoxes: Model-theoretic charac-
     terizations and rewritability. In IJCAI, pages 983–988, 2011.
[14] Y. Ma and F. Distel. Learning formal definitions for snomed CT from text. In
     Artificial Intelligence in Medicine - 14th Conference on Artificial Intelligence in
     Medicine, AIME 2013, Murcia, Spain, May 29 - June 1, 2013. Proceedings, pages
     73–77, 2013.
[15] M. A. Musen. Protégé ontology editor. Encyclopedia of Systems Biology, pages
     1763–1765, 2013.
[16] C. Reddy and P. Tadepalli. Learning First-Order Acyclic Horn Programs from
     Entailment. In in Proceedings of the 15th International Conference on Machine
     Learning; (and Proceedings of the 8th International Conference on Inductive Logic
     Programming, pages 23–37. Morgan Kaufmann, 1998.
[17] H. Stuckenschmidt, C. Parent, and S. Spaccapietra, editors. Modular Ontologies:
     Concepts, Theories and Techniques for Knowledge Modularization, volume 5445
     of Lecture Notes in Computer Science. Springer, 2009.
[18] H. Wang, M. Horridge, A. Rector, N. Drummond, and J. Seidenberg. Debugging
     OWL-DL ontologies: A heuristic approach. In The Semantic Web–ISWC 2005,
     pages 745–757. Springer, 2005.