Exact Learning Description Logic Ontologies from Data Retrieval Examples Boris Konev, Ana Ozaki, and Frank Wolter Department of Computer Science, University of Liverpool, UK Abstract. We investigate the complexity of learning description logic ontologies in Angluin et al.’s framework of exact learning via queries posed to an oracle. We consider membership queries of the form “is individual a a certain answer to a data retrieval query q in a given ABox and the unkown target TBox?” and equivalence queries of the form “is a given TBox equivalent to the unknown target TBox?”. We show that (i) DL-Lite TBoxes with role inclusions and ELI concept expressions on the right-hand side of inclusions and (ii) EL TBoxes without complex concept expressions on the right-hand side of inclusions can be learned in polynomial time. Both results are proved by a non-trivial reduction to learning from subsumption examples. We also show that arbitrary EL TBoxes cannot be learned in polynomial time. 1 Introduction Building an ontology is prone to errors, time consuming, and costly. The research com- munities has addressed this problem in many different ways, for example, by supplying tool support for editing ontologies [15, 4, 9], developing reasoning support for debug- ging ontologies [18], supporting modular ontology design [17], and by investigating automated ontology generation from data or text [8, 6, 5, 14]. One major problem when building an ontology is the fact that domain experts are rarely ontology engineering ex- perts and that, conversely, ontology engineers are typically not familiar with the domain of the ontology. An ontology building project therefore often relies on the successful communication between an ontology engineer (familiar with the semantics of ontology languages) and a domain expert (familiar with the domain of interest). In this paper, we consider a simple model of this communication process and analyse, within this model, the computational complexity of reaching a correct domain ontology. We assume that – the domain expert knows the domain ontology and its vocabulary without being able to formalize or communicate this ontology; – the domain expert is able to communicate the vocabulary of the ontology and shares it with the the ontology engineer. Thus, the domain expert and ontology engineer have a common understanding of the vocabulary of the ontology. The ontology engineer knows nothing else about the domain. – the ontology engineer can pose queries to the domain expert which the domain expert answers truthfully. Assuming that the domain expert can interpret data in her area of expertise, the main queries posed by the ontology engineer are based on instance retrieval examples: • assume a data instance A and a query q(x) are given. Is the individual a a certain answer to query q(x) in A and the ontology O? In addition, we require a way for the ontology engineer to find out whether she has reconstructed the target ontology already and, if this is not the case, to request an example illustrating the incompleteness of the reconstruction. We abstract from defining a communication protocol for this, but assume for simplicity that the following query can be posed by the ontology engineer: • Is this ontology H complete? If not, return a data instance A, a query q(x), and an individual a such that a is a certain answer to q(x) in A and the ontology O and it is not a certain answer to q(x) in A and the ontology H. Given this model, our question is whether the ontology engineer can learn the target ontology O and which computational resources are required for this depending on the ontology language in which the ontology O and the hypothesis ontologies H are formu- lated. Our model obviously abstracts from a number of fundamental problems in building ontologies and communicating about them. In particular, it makes the assumption that the domain expert knows the domain ontology and its vocabulary (without being able to formalize it) despite the fact that finding an appropriate vocabulary for a domain of interest is a major problem in ontology design [8]. We make this assumption here in order to isolate the problem of communication about the logical relationships between known vocabulary items and its dependence on the ontology language within which the relationships can be formulated. The model described above is an instance of Angluin et al.’s framework of exact learning via queries to an oracle [1]. The queries using instance retrieval examples can be regarded as membership queries posed by a learner to an oracle and the completeness query based on a hypothesis H can be regarded as an equivalence query by the learner to the oracle. Formulated in Angluin’s terms we are thus interested in whether there exists a deterministic learning algorithm that poses membership and equivalence queries of the above form to an oracle and that learns an arbitrary ontology over a given ontology language in polynomial time. We consider polynomial learnability in three distinct DLs: we show that DL-Lite ontologies with role inclusions and arbitrary ELI concepts on the right-hand side of concept inclusions can be learned in polynomial time if database queries in instance retrieval examples are ELI instance queries (or, equivalently, acyclic conjunctive queries). We call this DL DL-Lite∃R and note that it is the core of the web ontology language profile OWL2 QL. We also note that without complex ELI concepts on the right-hand side of concept inclusions, polynomial learnability would be trivial as only finitely many non-equivalent such TBoxes exist over a given vocabulary of concept and role names. The second DL we consider is EL which is the logic underpinning the web ontology language profile OWL2 EL. We show that EL TBoxes cannot be learned in polynomial time using the protocol above if the database queries in instance retrieval examples are EL instance queries. We then consider the fragment ELlhs of EL without complex concepts on the right-hand side of concept inclusions and prove that it can be learned in polynomial time using the above protocol with instance retrieval examples. The proofs of the positive learning results are by reduction to polynomial time learnability results for DL-Lite∃R and ELlhs for the case in which concept subsumptions rather than instance retrieval examples are used in the communication between the learner and the oracle [12]. Our move from concept subsumptions to data retrieval examples is motivated by the observation that domain experts are often more familiar with querying data in their domain than with the logical notion of subsumption between complex concepts. Detailed proofs are provided at http://csc.liv.ac.uk/˜frank/publ/publ.html. 2 Preliminaries Let NC and NR be countably infinite sets of concept and role names, respectively. The dialect DL-Lite∃R of DL-Lite is defined as follows [7]. A role is a role name or an inverse role r− with r ∈ NR . A role inclusion (RI) is of the form r v s, where r and s are roles. A basic concept is either a concept name or of the form ∃r.>, with r a role. A DL-Lite∃R concept inclusion (CI) is of the form B v C, where B is a basic concept expression and C is an ELI concept expression, that is, C is formed according to the rule C, D := A | > | C u D | ∃r.C | ∃r− .C where A ranges over NC and r ranges over NR . A DL-Lite∃R TBox is a finite set of DL-Lite∃R CIs and RIs. As usual, an EL concept expression is an ELI concept expression that does not use inverse roles, an EL concept inclusion has the form C v D with C and D EL concept expressions, and a (general) EL TBox is a finite set of EL concept inclusions [2]. We also consider the restriction ELlhs of general EL TBoxes where only concept names are allowed on the right-hand side of concept inclusions. The size of a concept expression C, denoted with |C|, is the length of the string that represents it, where concept names and role names are considered to be of length one. A TBox signature is the set of concept P and role names occurring in the TBox. The size of a TBox T , denoted with |T |, is CvD∈T |C| + |D|. Let NI be a countably infinite set of individual names. An ABox A is a finite non- empty set containing concept name assertions A(a) and role assertions r(a, b), where a, b are individuals in NI , A is a concept name and r is a role. Ind(A) denotes the set of individuals that occur in A. A is a singleton ABox if it contains only one ABox assertion. Assertions of the form C(a) and r(a, b), where a, b ∈ NI , C an ELI concept expression, and r ∈ NR , are called instance assertions. Note that instance assertions of the form C(a) with C not a concept name nor C = > do not occur in ABoxes. The semantics of description logic is defined as usual [3]. We write I |= α to say that an inclusion or assertion α is true in I. An interpretation I is a model of a KB (T , A) if I |= α for all α ∈ T ∪ A. (T , A) |= α means that I |= α for all models I of (T , A). A learning framework F is a triple (X, L, µ), where X is a set of examples (also called domain or instance space), L is a set of learning concepts1 and µ is a mapping from L to 2X . The subsumption learning framework FS , studied in [12], is defined as (XS , L, µS ), where L is the set of all TBoxes that are formulated in a given DL; XS is the set of subsumption examples of the form C v D, where C, D are concept expressions of the DL under consideration; and µS (T ) is defined as {C v D ∈ XS | T |= C v D}, for every T ∈ L. It should be clear that µS (T ) = µS (T 0 ) if, and only if, the TBoxes T and T 0 entail the same set of inclusions, that is, they are logically equivalent. 1 In the learning literature (e.g., [1]), the term ‘learning concept’ is often defined as a set of examples. We do not distinguish between learning concepts and their representations and only consider representable learning concepts to emphasize on the task of identifying a TBox that is logically equivalent to the target TBox. We study the data retrieval learning framework FD defined as (XD , L, µD ), where L is same as in FS ; X is the set of data retrieval examples of the form (A, D(a)), where A is an ABox, D(a) is a concept assertion of the DL under consideration, and a ∈ Ind(A); and µ(T ) = {(A, D(a)) ∈ XD | (T , A) |= D(a)}. As in the case of learning from subsumptions, µS (T ) = µS (T 0 ) if, and only if, the TBoxes T and T 0 are logically equivalent. Given a learning framework F = (X, L, µ), we are interested in the exact identi- fication of a target learning concept l ∈ L by posing queries to oracles. Let MEMl,X be the oracle that takes as input some x ∈ X and returns ‘yes’ if x ∈ µ(l) and ‘no’ otherwise. We say that x is a positive example for l if x ∈ µ(l) and a negative example for l if x 6∈ µ(l). Then a membership query is a call to the oracle MEMl,X . Similarly, for every l ∈ L, we denote by EQl,X the oracle that takes as input a hypothesis learning concept h ∈ L and returns ‘yes’, if µ(h) = µ(l), or a counterexample x ∈ µ(h) ⊕ µ(l) otherwise, where ⊕ denotes the symmetric set difference. An equivalence query is a call to the oracle EQl,X . We say that a learning framework (X, L, µ) is exact learnable if there is an algorithm A such that for any target l ∈ L the algorithm A always halts and outputs l0 ∈ L such that µ(l) = µ(l0 ) using membership and equivalence queries answered by the oracles MEMl,X and EQl,X , respectively. A learning framework (X, L, µ) is polynomially exact learnable if it is exact learnable by an algorithm A such that at every step2 of computation the time used by A up to that step is bounded by a polynomial p(|l|, |x|), where l is the target and x ∈ X is the largest counterexample seen so far3 . As argued in the introduction, for learning subsumption and data retrieval learning frameworks we additionally assume that the signature of the target TBox is always known to the learner. An important class of learning algorithms—in particular, all algorithms presented in [12, 10, 16] fit in this class—always make equivalence queries with hypotheses h which are polynomial in the size of l and such that µ(h) ⊆ µ(l), so that counterexamples returned by the EQl,X oracles are always positive. We say that such algorithms use positive bounded equivalence queries. 3 Polynomial Time Learnability In this section we prove polynomial time exact learnability of the DL-Lite∃R and ELlhs data retrieval learning frameworks. These frameworks are instances of the general definition given above, where the concept expression D in a data retrieval example (A, D(a)) is an ELI concept expression in the DL-Lite∃R framework and an EL concept expression in the ELlhs framework, respectively. The proof is by reduction to learning from subsumptions. We illustrate its idea for ELlhs . To learn a TBox from data retrieval examples we run a learning from subsumptions algorithm as a ‘black box’. Every time the learning from subsumptions algorithm makes a membership or an equivalence query we rewrite the query into the data setting and pass it on to the data retrieval oracle. The oracle’s answer, rewritten back to the subsumption 2 We count each call to an oracle as one step of computation. 3 We assume some natural notion of a length of an example x and a learning concept l, denoted |x| and |l|, respectively. A A A A A A A A r .. s r .. s r .. s r .. s . . . . r,s r s r s A A A r A s Fig. 1: An ABox A = {r(a, a), s(a, a), A(a)} and its unravelling up to level n. setting, is given to the learning from subsumptions algorithm. When the learning from subsumptions algorithm terminates we return the learnt TBox. This reduction is made possible by the close relationship between data retrieval and subsumption examples. For every TBox T and inclusions C v D, one can interpret a concept expression C as a labelled tree and encode this tree as an ABox AC with root ρC such that T |= C v D iff (T , AC ) |= D(ρC ). Then, membership queries in the subsumption setting can be answered with the help of a data retrieval oracle due to the relation between subsumptions and instance queries described above. An inclusion C v D is a (positive) subsumption example for some target TBox T if, and only if, (AC , D(ρC )) is a (positive) data retrieval example for the same target T . To handle equivalence queries, we need to be able to rewrite data retrieval counterexamples returned by the data retrieval oracle into the subsumption setting. For every TBox T and data retrieval query (A, D(a)) one can construct a concept expression CA such that (T , A) |= D(a) iff T |= CA v D. Such a concept expression CA can be obtained by unravelling A into a tree-shaped ABox and representing it as a concept expression. This unravelling, however, can increase the ABox size exponentially. Thus, to obtain a polynomial bound on the running time of the learning process, CA v D cannot be simply returned as an answer to a subsumption equivalence query. For example, for a target TBox T = {∃rn .A v B} and a hypothesis H = ∅ the data retrieval query (A, B(a)), where A = {r(a, a), s(a, a), A(a)}, is a positive counterexample. The tree-shaped unravelling of A up to level n is a full binary tree of depth n, as shown in Fig. 1. On the other hand, the non-equivalence of T and H can already be witnessed by (A0 , B(a)), where A0 = {r(a, a), A(a)}. The unravelling of A0 up to level n produces a linear size ABox {r(a, a2 ), r(a2 , a3 ), . . . , r(an−1 , an ), A(a), A(a2 ), . . . , A(an )}, corresponding to the left-most path in Fig. 1, which, in turn, is linear-size w.r.t. the target inclusion ∃rn .A v B. Notice that A0 is obtained from A by removing the s(a, a) edge and checking, using membership queries, whether (T , A0 ) |= q still holds. In other words, one might need to ask further membership queries in order to rewrite answers to data retrieval equivalence queries given by the data retrieval oracle into the subsumption setting. We address the need of rewriting counterexamples by introducing an abstract notion of reduction between different exact learning frameworks. To simplify notation, we assume that both learning frameworks use the same set of learning concepts L and only consider positive bounded equivalence queries. This definition of reduction can be easily extended to arbitrary learning frameworks and arbitrary queries. We say that a learning framework F = (X, L, µ) polynomially reduces to F0 = (X , L, µ0 ) if for some polynomials p1 (·), p2 (·) and p3 (·, ·) there exist a function f : 0 X 0 → X and a partial function g : L × L × X → X 0 , defined for every (l, h, x) such that |h| = p1 (|l|), µ(h) ⊆ µ(l) and x ∈ X, for which the following conditions hold. – For all x0 ∈ X 0 we have x0 ∈ µ0 (l) if, and only if, f (x0 ) ∈ µ(l); – For all x ∈ X we have x ∈ µ(l) \ µ(h) if, and only if, g(l, h, x) ∈ µ0 (l) \ µ0 (h); – f (x0 ) is computable in time p2 (|x0 |); – g(l, h, x) is computable in time p3 (|l|, |x|) and l can only be accessed by calls to the membership oracle MEMl,X . As in the case of learning algorithms, we consider every call to the oracle as one step of computation. Notice also that even though g takes h as input, the polynomial time bound on computing g(l, h, x) does not depend on the size of h as g is only defined for h polynomial in the size of l. Theorem 1. Let (X, L, µ) and (X 0 , L, µ0 ) be learning frameworks. If there exists a polynomial reduction from (X, L, µ) to (X 0 , L, µ0 ) and a polynomial learning algorithm for (X 0 , L, µ0 ) that uses membership queries and positive bounded equivalence queries then (X, L, µ) is polynomially exact learnable. We use Theorem 1 to prove that DL-Lite∃R and ELlhs TBoxes can be learned in polynomial time from data retrieval examples. We employ the following result: Theorem 2 ([12]). The DL-Lite∃R and ELlhs subsumption learning frameworks are polynomial time exact learnable with membership and positive bounded equivalence queries. As the existence of f is guaranteed by the following lemma, in what follows we prove the existence of g and establish the corresponding time bounds. Lemma 1. Let L ∈ {DL-Lite∃R , ELlhs } and let C v D be an L concept inclusion. Then (T , AC ) |= D(ρC ) if, and only if, T |= C v D. Polynomial Reduction for DL-Lite∃ R TBoxes We show for any target T and hy- pothesis H polynomial in the size of T that Algorithm 1 transforms every positive counterexample in polynomial time to a positive counterexample with a singleton ABox (i.e., of the form {A(a)} or {r(a, b)}). Using the equivalences (T , {A(a)}) |= C(a) iff T |= A v C and (T , {r(a, b)}) |= C(a) iff T |= ∃r.> v C, we then obtain a positive subsumption counterexample, so g(l, h, x) is computable in polynomial time. Given a positive data retrieval counterexample (A, C(a)), Algorithm 1 exhaustively applies the role saturation and parent-child merging rules introduced in [12]. We say that an instance assertion C(a) is role saturated for (T , A) if (T , A) 6|= C 0 (a) whenever C 0 is the result of replacing a role r by some role s ∈ NR ∩ ΣT with T 6|= r v s and T |= s v r, where ΣT is the signature of the target TBox T known to the learner. To define parent/child merging, we identify each ELI concept C with a finite tree TC whose nodes are labeled with concept names and edges are labeled with roles in the standard way. For example, if C = ∃t.(A u ∃r.∃r− .∃r.B) u ∃s.> then Fig. 2a illustrates Algorithm 1 Reducing the positive counterexample 1: Let C(a) be an instance assertion such that (H, A) 6|= C(a) and (T , A) |= C(a) 2: function R EDUCE C OUNTER E XAMPLE( A, C(a) ) 3: Find a role saturated and parent/child merged C(a) (membership queries) 4: if C = C0 u ... u Cn then 5: Find Ci , 0 ≤ i ≤ n, such that (H, A) 6|= Ci (a) 6: C := Ci 7: if C = ∃r.C 0 and there is r(a, b) ∈ A such that (T , A) |= C 0 (b) then 8: R EDUCE C OUNTER E XAMPLE( A, C 0 (b) ) 9: else 10: Find a singleton A0 ⊆ A such that (T , A0 ) |= C(a) but 11: (H, A0 ) 6|= C(a) (membership queries) 12: return (A0 ,C(a)) TC . Now, we say that an instance assertion C(a) is parent/child merged for T and A if for nodes n1 , n2 , n3 in TC such that n2 is an r-successor of n1 , n3 is an s-successor of n2 and T |= r− ≡ s we have (T , A) 6|= C 0 (a) if C 0 is the concept that results from identifying n1 and n3 . For instance, the concept in Fig. 2c is the result of identifying the leaf labeled with B in Fig. 2b with the parent of its parent. We present a run of Algorithm 1 for T = {A v ∃s.B, s v r} and H = {s v r}. As- sume the oracle gives as counterexample (A, C(a)), where A = {t(a, b), A(b), s(a, c)} and C(a) = ∃t.(A u ∃r.∃r− .∃r.B) u ∃s.>(a) (Fig. 2a). Role saturation produces C(a) = ∃t.(A u ∃s.∃s− .∃s.B) u ∃s.>(a) (Fig. 2b). Then, applying parent/child merg- ing twice we obtain C(a) = ∃t.(A u ∃s.B) u ∃s.>(a) (Fig. 2c and 2d). B B (a) r (b) s (c) (d) r s s B B r s s s A A A A s t s t s t s t Fig. 2: Concept C being role saturated and parent/child merged. Since (H, A) 6|= ∃t.(A u ∃s.B)(a), after Lines 4-6, Algorithm 1 updates C by choosing the conjunct ∃t.(A u ∃s.B). As C is of the form ∃t.C 0 and there is t(a, b) ∈ A such that (T , A) |= C 0 (b), the algorithm recursively calls the function “ReduceCoun- terExample” with A u ∃s.B(b). Now, since (H, A) 6|= ∃s.B(b), after Lines 4-6, C is updated to ∃s.B. Finally, C is of the form ∃t.C 0 and there is no t(b, c) ∈ A such that (T , A) |= C 0 (c). So the algorithm proceeds to Lines 11-12, where it chooses A(b) ∈ A. Since (T , {A(b)}) |= ∃s.B(b) and (H, {A(b)}) 6|= ∃s.B(b) we have that T |= A v ∃s.B and H 6|= A v ∃s.B. Lemma 2. Let (A, C(a)) be a positive counterexample. Then the following holds: 1. if C is a basic concept then there is a singleton A0 ⊆ A such that (T , A0 ) |= C(a); Algorithm 2 Minimizing an ABox A 1: Let A be an ABox such that (T , A) |= A(a) but (H, A) 6|= A(a), for A ∈ NC , a ∈ Ind(A). 2: function M INIMIZE AB OX( A) 3: Concept saturate A with H 4: for every A ∈ NC ∩ ΣT and a ∈ Ind(A) such that 5: (T , A) |= A(a) and (H, A) 6|= A(a) do 6: Domain Minimize A with A(a) 7: Role Minimize A with A(a) 8: return (A) 2. if C is of the form ∃r.C 0 (or ∃r− .C 0 ) and C is role saturated and parent/child merged then either there is r(a, b) ∈ A (or r(b, a) ∈ A ) such that (T , A) |= C 0 (b) or there is a singleton A0 ⊆ A such that (T , A0 ) |= C(a). Lemma 3. For any target DL-Lite∃R TBox T and hypothesis DL-Lite∃R TBox H given a positive data retrieval counterexample (A, C(a)), Algorithm 1 computes in time polynomial in |T |, |H|, |A| and |C| a counterexample C 0 (b) such that (T , A0 ) |= C 0 (b), where A0 ⊆ A is a singleton ABox. Proof. (Sketch) Let (A, C(a)) be the input of “ReduceCounterExample”. The number of membership queries in Line 3 is polynomial in |C| and |T |. If C has more than one conjunct then it is updated in Lines 4-6, so C becomes either (1) a basic concept or (2) of the form ∃r.C 0 (or ∃r− .C 0 ). By Lemma 2 in case (1) there is a singleton A0 ⊆ A such that (T , A0 ) |= C(a), computed by Line 11 of Algorithm 1. In case (2) either there is a singleton A0 ⊆ A such that (T , A0 ) |= C(a), computed by Line 11 of Algorithm 1, or we obtain a counterexample with a refined C. Since the size of the refined counterexample is strictly smaller after every recursive call of “ReduceCounterExample”, the total number of calls is bounded by |C|. o Using Theorem 2 and Theorem 1 we obtain: Theorem 3. The DL-Lite∃R data retrieval framework is polynomially exact learnable. Polynomial Reduction for ELlhs TBoxes In this section we give a polynomial algo- rithm computing g for ELlhs . First we note that the concept assertion in data retrieval counterexamples (A, D(a)) can always be made atomic. Let ΣT be the signature of the target TBox T . Lemma 4. If (A, D(a)) is a positive counterexample then by posing polynomially many membership queries one can find a concept name A ∈ ΣT and an individual b ∈ Ind(A) such that (A, A(b)) is also a counterexample. Thus it suffices to show that given a positive counterexample (A, D(a)) with D ∈ NC , one can compute an EL concept expression C bounded in size by |T | such that (T , {C(b)}) |= A(b) and (H, {C(b)}) 6|= A(b), where A ∈ NC . As (T , {C(b)}) |= A(b) if and only if T |= C v A, we obtain a positive subsumption counterexample. Our algorithm for computing g is based on two operations: minimization, computed by Algorithm 3 Computing a tree shaped ABox 1: function F IND T REE( A) 2: M INIMIZE AB OX( A) 3: while there is a cycle c in A do 4: Unfold a ∈ Ind(A) in cycle c 5: M INIMIZE AB OX( A) 6: Let C be the concept expression corresponding to A with counterexample A(a). 7: return (C(a),A(a)) Algorithm 2, and unfolding. Algorithm 2 minimizes a given ABox with the following rules. (Concept saturate A with H) If A(a) ∈ / A and (H, A) |= A(a) then replace A by A ∪ {A(a)}, where A ∈ NC ∩ ΣT and a ∈ Ind(A). (Domain Minimize A with A(a)) If A(a) is a counterexample and (T , A−b ) |= A(a) then replace A by A−b , where A−b is the result of removing from A all ABox assertions in which b occurs. (Role Minimize A with A(a)) If A(a) is a counterexample and (T , A−r(b,c) ) |= A(a) then replace A by A−r(b,c) , where A−r(b,c) be obtained by removing a role assertion r(b, c) from A. Lemma 5. Given a positive counterexample (A, D(a)) with D ∈ NC , Algorithm 2 computes in polynomially many steps with respect to |A|, |H|, and |T | an ABox A0 such that |Ind(A0 )| ≤ |T | and (A0 , A(b)) is a positive counterexample, for some A ∈ NC and b ∈ Ind(A0 ). It remains to show that A can be made tree-shaped. We say that A has an (undirected) cycle if there is a finite sequence a0 · r1 · a1 · ... · rk · ak such that (i) a0 = ak and (ii) there are mutually distinct assertions of the form ri+1 (ai , ai+1 ) or ri+1 (ai+1 , ai ) in A, for 0 ≤ i < k. The unfolding of a cycle c = a0 ·r1 ·a1 ·...·rk ·ak in a given ABox A is obtained by replacing c by the cycle c0 = a0 ·r1 ·a1 ·...·rk ·ak−1 ·rk ·b a0 ·r1 · · · b ak−1 ·rk ·a0 , where ai are fresh individual names, 0 ≤ i ≤ k −1, in such a way that (i) if r(ai , d) ∈ A, for an b individual d not in the cycle, then r(b ai , d) ∈ A; and (ii) if A(ai ) ∈ A then A(b ai ) ∈ A. We prove in the full version that after every unfolding-minimisation step in Algo- rithm 3 the ABox A on the one hand becomes strictly larger and on the other does not exceed the size of the target TBox T . Thus Algorithm 3 terminates after a polynomial number of steps yielding a tree-shaped counterexample. Lemma 6. Algorithm 3 computes a minimal tree shaped ABox A with size polynomial in |T | and runs in polynomially many steps in |T | and |A|. Using Theorem 2 and Theorem 1 we obtain: Theorem 4. The ELlhs data retrieval framework is polynomially exact learnable. 4 Limits of Polynomial Time Learnability Our proof of non-polynomial learnability of general EL TBoxes from data retrieval examples extends previous results on non-polynomial learnability of EL TBoxes from subsumptions [12]. We start by giving a brief overview of the construction in [12], show that it fails in the data retrieval setting and then demonstrate how it can be modified. The non-learnability proof in [12] proceeds as follows. A learner tries to exactly identify one of the possible target TBoxes {TL | L ∈ Ln }, for a superpolynomial in n set Ln defined below. At every stage of computation the oracle maintains a set of TBoxes S, which the learner is unable to distinguish based on the answers given so far. Initially S = {TL | L ∈ Ln }. It has been proved that for any EL inclusion C v D either TL |= C v D for every L ∈ Ln or the number of L ∈ Ln such that TL |= C v D does not exceed |C|. When a polynomial learner asks a membership query C v D the oracle answers ‘yes’ if TL |= C v D for every L ∈ Ln and ‘no’ otherwise. In the latter case the oracle removes polynomially many TL such that TL |= C v D from S. Similarly, for any equivalence query with hypothesis H asked by a polynomial learning algorithm there exists a polynomial size inclusion C v D, which can be returned as a counterexample and that excludes only polynomially many TBoxes from S. Thus, every query to the oracle reduces the size of S at most polynomially in n, but then the learner cannot distinguish between the remaining TBoxes of our initial superpolynomial set S. The set of indices Ln and the target TBoxes TL are defined as follows. Fix two role names r and s. An n-tuple L is a sequence of role sequences (σ1 , . . . , σn ), where every σi is a sequence of role names r and s, that is σi = σi1 σi2 . . . σin with σij ∈ {r, s}. Then Ln is a set of n-tuples such that for every L, L0 ∈ Ln with L = (σ1 , . . . , σn ), L0 = (σ10 , . . . , σn0 ), if σi = σj0 then L = L0 and i = j. There are N = b2n /nc different tuples in Ln . For every n > 0 and every n-tuple L = (σ1 , . . . , σn ) we define an acyclic EL TBox TL as the union of T0 = {Xi v ∃r.Xi+1 u ∃s.Xi+1 | 0 ≤ i < n} and the following inclusions: A1 v ∃σ1 .M u X0 An v ∃σn .M u X0 ... B1 v ∃σ1 .M u X0 Bn v ∃σn .M u X0 A ≡ X0 u ∃σ1 .M u · · · u ∃σn .M. where the expression ∃σ.C stands for ∃σ 1 .∃σ 2 . . . ∃σ n .C, M is a concept name that ‘marks’ a designated path given by σ and T0 generates a full binary tree whose edges are labelled with the role names r and s and with X0 at the root, X1 at level 1 and so on. In contrast to the subsumption framework, every TL can be exactly identified using data retrieval queries. For example, as X0 u ∃σ1 .M u · · · u ∃σn .M v A ∈ TL , a learning from data retrieval queries algorithm can learn all the sequences in the n- tuple L = (σ1 , . . . , σn ), by defining an ABox A = {X0 (a1 ), r(a1 , a2 ), s(a1 , a2 ), . . . , r(an−1 , an ), s(an−1 , an ), M (an )} and then proceeding with unfolding and minimizing A via membership queries of the form (TL , A) |= A(a1 ). To show the non-tractability for data retrieval queries, we first modify S in such a way that the concept expression which ‘marks’ the sequences in L = (σ1 , . . . , σn ) is now given by the set Bn of all conjunctions F1 u · · · u Fn , where Fi ∈ {Ei , Ēi }, for 1 ≤ i ≤ n. Intuitively, every member of Bn encodes a binary string of length n with Ei encoding 1 and Ēi encoding 0. For every L ∈ Ln and every B ∈ Bn we define TLB as the union of T0 and the concept inclusions defined above with B replacing M . Then for any sequence σ of length n there exists at most one L ∈ Ln , at most one 1 ≤ i ≤ n and at most one B ∈ Bn such that TLB |= Ai v ∃σ.B and TLB |= Bi v ∃σ.B. Notice that the size of each TLB is polynomial in n and so Ln contains B dn of each TL , with L ∈ Ln and B ∈ Bn . superpolynomially many n-tuples in the size B Every TL entails, among other inclusions, i=1 Ci v A, where every Ci is either Ai or Bi . Let Σn be the signature of the TBoxes TLB and consider a TBox T ∗ defined as the following set of concept inclusions: ∃r.(E1 u Ē1 ) v (E1 u Ē1 ), (E1 u Ē1 ) v ∃r.(E1 u Ē1 ), ∃s.(E1 u Ē1 ) v (E1 u Ē1 ), (E1 u Ē1 ) v ∃s.(E1 u Ē1 ), (Ei u Ēi ) v A for every 1 ≤ i ≤ n and every A ∈ Σn ∩ NC The basic idea of extending our TBoxes with T ∗ is that if a ∈ (Ei u Ēi )IA , for an ABox A and individual a ∈ Ind(A), then for all L ∈ Ln and B ∈ Bn , we have (TLB , A) |= D(b), where D is any EL concept expression over Σn and b ∈ Ind(A) is any successor or predecessor of a (or a itself). This means that for each individual in A at most one B of the 2n binary strings in Bn can be distinguished by data retrieval queries. The following lemma enables us to respond to membership queries without eliminating too many L ∈ Ln and B ∈ Bn used to encode TLB in the set of TBoxes that the learner cannot distinguish. Lemma 7. For any ABox A, any EL concept assertion D(a) over Σn , and any a ∈ Ind(A), if there is L ∈ Ln and B ∈ Bn such that (TLB ∪ T ∗ , A) |= D(a) then: – either (TLB ∪ T ∗ , A) |= D(a), for every L ∈ Ln and B ∈ Bn , or – (TLB ∪ T ∗ , A) |= D(a) for at most |D| elements L ∈ Ln , or – (TLB ∪ T ∗ , A) |= D(a) for at most |A| elements B ∈ Bn . The next lemma is immediate from Lemma 15 presented in [12]. It shows how the oracle can answer equivalence queries eliminating at most one L ∈ Ln used to encode TLB in the set S of TBoxes that the learner cannot distinguish. Lemma 8. For any n > 1 and any EL TBox H in Σn with |H| < 2n , there exists an ABox A, an individual a ∈ Ind(A) and an EL concept expression D over Σn such that (i) the size of A plus the size of D does not exceed 6n and (ii) if (H, A) |= D(a) then (TLB , A) |= D(a) for at most one L ∈ Ln and if (H, A) 6|= D(a) then for every L ∈ Ln we have (TLB ∪ T ∗ , A) |= D(a). Then, by Lemmas 7 and 8, we have that: (i) any polynomial size membership query can distinguish at most polynomially many TBoxes from S; and (ii) if the learner’s hypothesis is polynomial size then there exists a polynomial size counterexample that the oracle can give which distinguishes at most polynomially many TBoxes from S. Theorem 5. The EL data retrieval framework is not polynomially exact learnable. 5 Future Work We plan to consider an extension of the learning protocol in which arbitrary conjunctive queries are admitted in queries to the domain expert/oracle. We then still have polynomial time learnability for ELlhs but conjecture non-polynomial time learnability for DL-Lite∃R . Another extension is exact learnability for the Horn-extension of DL-Lite∃R for which we conjecture that polynomial time learnability still holds. Bibliography [1] D. Angluin. Queries and concept learning. Machine Learning, 2(4):319–342, 1987. [2] F. Baader, S. Brandt, and C. Lutz. Pushing the EL envelope. In IJCAI, pages 364–369. Professional Book Center, 2005. [3] F. Baader, D. Calvanese, D. McGuiness, D. Nardi, and P. Patel-Schneider. The De- scription Logic Handbook: Theory, implementation and applications. Cambridge University Press, 2003. [4] S. Bechhofer, I. Horrocks, C. Goble, and R. Stevens. Oiled: a reason-able ontology editor for the semantic web. In KI 2001: Advances in Artificial Intelligence, pages 396–408. Springer, 2001. [5] D. Borchmann and F. Distel. Mining of EL-GCIs. In The 11th IEEE International Conference on Data Mining Workshops, Vancouver, Canada, 11 December 2011. IEEE Computer Society. [6] P. Buitelaar, P. Cimiano, and B. Magnini, editors. Ontology Learning from Text: Methods, Evaluation and Applications. IOS Press, 2005. [7] D. Calvanese, G. De Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tractable reasoning and efficient query answering in description logics: The DL-Lite family. Journal of Automated reasoning, 39(3):385–429, 2007. [8] P. Cimiano, A. Hotho, and S. Staab. Learning concept hierarchies from text corpora using formal concept analysis. J. Artif. Intell. Res. (JAIR), 24:305–339, 2005. [9] J. Day-Richter, M. A. Harris, M. Haendel, S. Lewis, et al. Obo-edit an ontology editor for biologists. Bioinformatics, 23(16):2198–2200, 2007. [10] M. Frazier and L. Pitt. Learning From Entailment: An Application to Propositional Horn Sentences. In ICML, pages 120–127, 1993. [11] B. Konev, M. Ludwig, D. Walther, and F. Wolter. The logical difference for the lightweight description logic EL. J. Artif. Intell. Res. (JAIR), 44:633–708, 2012. [12] B. Konev, C. Lutz, A. Ozaki, and F. Wolter. Exact learning of lightweight descrip- tion logic ontologies. In Principles of Knowledge Representation and Reasoning: Proceedings of the Fourteenth International Conference, KR 2014, Vienna, Austria, July 20-24, 2014, 2014. [13] C. Lutz, R. Piro, and F. Wolter. Description logic TBoxes: Model-theoretic charac- terizations and rewritability. In IJCAI, pages 983–988, 2011. [14] Y. Ma and F. Distel. Learning formal definitions for snomed CT from text. In Artificial Intelligence in Medicine - 14th Conference on Artificial Intelligence in Medicine, AIME 2013, Murcia, Spain, May 29 - June 1, 2013. Proceedings, pages 73–77, 2013. [15] M. A. Musen. Protégé ontology editor. Encyclopedia of Systems Biology, pages 1763–1765, 2013. [16] C. Reddy and P. Tadepalli. Learning First-Order Acyclic Horn Programs from Entailment. In in Proceedings of the 15th International Conference on Machine Learning; (and Proceedings of the 8th International Conference on Inductive Logic Programming, pages 23–37. Morgan Kaufmann, 1998. [17] H. Stuckenschmidt, C. Parent, and S. Spaccapietra, editors. Modular Ontologies: Concepts, Theories and Techniques for Knowledge Modularization, volume 5445 of Lecture Notes in Computer Science. Springer, 2009. [18] H. Wang, M. Horridge, A. Rector, N. Drummond, and J. Seidenberg. Debugging OWL-DL ontologies: A heuristic approach. In The Semantic Web–ISWC 2005, pages 745–757. Springer, 2005.