Introduction

Exact Learning Description Logic Ontologies from Data Retrieval Examples

Boris Konev

Ana Ozaki

Frank Wolter

0 0 Department of Computer Science, University of Liverpool , UK

We investigate the complexity of learning description logic ontologies in Angluin et al.'s framework of exact learning via queries posed to an oracle. We consider membership queries of the form “is individual a a certain answer to a data retrieval query q in a given ABox and the unkown target TBox?” and equivalence queries of the form “is a given TBox equivalent to the unknown target TBox?”. We show that (i) DL-Lite TBoxes with role inclusions and E LI concept expressions on the right-hand side of inclusions and (ii) E L TBoxes without complex concept expressions on the right-hand side of inclusions can be learned in polynomial time. Both results are proved by a non-trivial reduction to learning from subsumption examples. We also show that arbitrary E L TBoxes cannot be learned in polynomial time.

Introduction

Building an ontology is prone to errors, time consuming, and costly. The research communities has addressed this problem in many different ways, for example, by supplying tool support for editing ontologies [ 15, 4, 9 ], developing reasoning support for debugging ontologies [18], supporting modular ontology design [17], and by investigating automated ontology generation from data or text [ 8, 6, 5, 14 ]. One major problem when building an ontology is the fact that domain experts are rarely ontology engineering experts and that, conversely, ontology engineers are typically not familiar with the domain of the ontology. An ontology building project therefore often relies on the successful communication between an ontology engineer (familiar with the semantics of ontology languages) and a domain expert (familiar with the domain of interest). In this paper, we consider a simple model of this communication process and analyse, within this model, the computational complexity of reaching a correct domain ontology. We assume that – the domain expert knows the domain ontology and its vocabulary without being able to formalize or communicate this ontology; – the domain expert is able to communicate the vocabulary of the ontology and shares it with the the ontology engineer. Thus, the domain expert and ontology engineer have a common understanding of the vocabulary of the ontology. The ontology engineer knows nothing else about the domain. – the ontology engineer can pose queries to the domain expert which the domain expert answers truthfully. Assuming that the domain expert can interpret data in her area of expertise, the main queries posed by the ontology engineer are based on instance retrieval examples: assume a data instance A and a query q(x) are given. Is the individual a a certain answer to query q(x) in A and the ontology O?

In addition, we require a way for the ontology engineer to find out whether she

has reconstructed the target ontology already and, if this is not the case, to request an example illustrating the incompleteness of the reconstruction. We abstract from defining a communication protocol for this, but assume for simplicity that the following query can be posed by the ontology engineer:

Is this ontology H complete? If not, return a data instance A, a query q(x), and an individual a such that a is a certain answer to q(x) in A and the ontology O and it is not a certain answer to q(x) in A and the ontology H.

Given this model, our question is whether the ontology engineer can learn the target ontology O and which computational resources are required for this depending on the ontology language in which the ontology O and the hypothesis ontologies H are formulated. Our model obviously abstracts from a number of fundamental problems in building ontologies and communicating about them. In particular, it makes the assumption that the domain expert knows the domain ontology and its vocabulary (without being able to formalize it) despite the fact that finding an appropriate vocabulary for a domain of interest is a major problem in ontology design [ 8 ]. We make this assumption here in order to isolate the problem of communication about the logical relationships between known vocabulary items and its dependence on the ontology language within which the relationships can be formulated.

The model described above is an instance of Angluin et al.’s framework of exact learning via queries to an oracle [ 1 ]. The queries using instance retrieval examples can be regarded as membership queries posed by a learner to an oracle and the completeness query based on a hypothesis H can be regarded as an equivalence query by the learner to the oracle. Formulated in Angluin’s terms we are thus interested in whether there exists a deterministic learning algorithm that poses membership and equivalence queries of the above form to an oracle and that learns an arbitrary ontology over a given ontology language in polynomial time. We consider polynomial learnability in three distinct DLs: we show that DL-Lite ontologies with role inclusions and arbitrary E LI concepts on the right-hand side of concept inclusions can be learned in polynomial time if database queries in instance retrieval examples are E LI instance queries (or, equivalently, acyclic conjunctive queries). We call this DL DL-Lite9 and note that it is the core of the web R ontology language profile OWL2 QL. We also note that without complex E LI concepts on the right-hand side of concept inclusions, polynomial learnability would be trivial as only finitely many non-equivalent such TBoxes exist over a given vocabulary of concept and role names. The second DL we consider is E L which is the logic underpinning the web ontology language profile OWL2 EL. We show that E L TBoxes cannot be learned in polynomial time using the protocol above if the database queries in instance retrieval examples are E L instance queries. We then consider the fragment E Llhs of E L without complex concepts on the right-hand side of concept inclusions and prove that it can be learned in polynomial time using the above protocol with instance retrieval examples.

The proofs of the positive learning results are by reduction to polynomial time learnability

results for DL-Lite9 and E Llhs for the case in which concept subsumptions rather than

R instance retrieval examples are used in the communication between the learner and the oracle [ 12 ]. Our move from concept subsumptions to data retrieval examples is motivated by the observation that domain experts are often more familiar with querying data in their domain than with the logical notion of subsumption between complex concepts. Detailed proofs are provided at http://csc.liv.ac.uk/˜frank/publ/publ.html. 2

Preliminaries

Let NC and NR be countably infinite sets of concept and role names, respectively. The dialect DL-Lite9 of DL-Lite is defined as follows [ 7 ]. A role is a role name or an inverse

R role r with r 2 NR. A role inclusion (RI) is of the form r v s, where r and s are roles. A basic concept is either a concept name or of the form 9r:>, with r a role. A

DL-Lite9 concept inclusion (CI) is of the form B v C, where B is a basic concept

R expression and C is an E LI concept expression, that is, C is formed according to the rule C; D := A j > j C u D j 9r:C j 9r :C where A ranges over NC and r ranges over NR. A DL-Lite9R TBox is a finite set of DL-Lite9R CIs and RIs. As usual, an E L concept expression is an E LI concept expression that does not use inverse roles, an E L concept inclusion has the form C v D with C and D E L concept expressions, and a (general) E L TBox is a finite set of E L concept inclusions [ 2 ]. We also consider the restriction E Llhs of general E L TBoxes where only concept names are allowed on the right-hand side of concept inclusions. The size of a concept expression C, denoted with jCj, is the length of the string that represents it, where concept names and role names are considered to be of length one. A TBox signature is the set of concept and role names occurring in the TBox. The size of a TBox T , denoted with jT j, is PCvD2T jCj + jDj.

Let NI be a countably infinite set of individual names. An ABox A is a finite non

empty set containing concept name assertions A(a) and role assertions r(a; b), where a; b are individuals in NI, A is a concept name and r is a role. Ind(A) denotes the set of individuals that occur in A. A is a singleton ABox if it contains only one ABox assertion.

Assertions of the form C(a) and r(a; b), where a; b 2 NI , C an E LI concept expression,

and r 2 NR , are called instance assertions. Note that instance assertions of the form C(a) with C not a concept name nor C = > do not occur in ABoxes. The semantics of description logic is defined as usual [ 3 ]. We write I j= to say that an inclusion or assertion is true in I. An interpretation I is a model of a KB (T ; A) if I j= for all 2 T [ A. (T ; A) j= means that I j= for all models I of (T ; A).

A learning framework F is a triple (X; L; ), where X is a set of examples (also called domain or instance space), L is a set of learning concepts1 and is a mapping from L to 2X . The subsumption learning framework FS , studied in [ 12 ], is defined as (XS ; L; S ), where L is the set of all TBoxes that are formulated in a given DL; XS is the set of subsumption examples of the form C v D, where C; D are concept expressions of the DL under consideration; and S (T ) is defined as fC v D 2 XS j T j= C v Dg, for every T 2 L. It should be clear that S (T ) = S (T 0) if, and only if, the TBoxes T and T 0 entail the same set of inclusions, that is, they are logically equivalent.

1 In the learning literature (e.g., [1]), the term ‘learning concept’ is often defined as a set of

examples. We do not distinguish between learning concepts and their representations and only consider representable learning concepts to emphasize on the task of identifying a TBox that is logically equivalent to the target TBox.

We study the data retrieval learning framework FD defined as (XD; L; D), where L is same as in FS ; X is the set of data retrieval examples of the form (A; D(a)),

where A is an ABox, D(a) is a concept assertion of the DL under consideration, and a 2 Ind(A); and (T ) = f(A; D(a)) 2 XD j (T ; A) j= D(a)g. As in the case of learning from subsumptions, S (T ) = S (T 0) if, and only if, the TBoxes T and T 0 are logically equivalent.

Given a learning framework F = (X; L; ), we are interested in the exact identification of a target learning concept l 2 L by posing queries to oracles. Let MEMl;X be the oracle that takes as input some x 2 X and returns ‘yes’ if x 2 (l) and ‘no’ otherwise. We say that x is a positive example for l if x 2 (l) and a negative example for l if x 62 (l). Then a membership query is a call to the oracle MEMl;X . Similarly, for every l 2 L, we denote by EQl;X the oracle that takes as input a hypothesis learning concept h 2 L and returns ‘yes’, if (h) = (l), or a counterexample x 2 (h) (l) otherwise, where denotes the symmetric set difference. An equivalence query is a call to the oracle EQl;X .

We say that a learning framework (X; L; ) is exact learnable if there is an algorithm A such that for any target l 2 L the algorithm A always halts and outputs l0 2 L such

that (l) = (l0) using membership and equivalence queries answered by the oracles

MEMl;X and EQl;X , respectively. A learning framework (X; L; ) is polynomially

exact learnable if it is exact learnable by an algorithm A such that at every step2 of computation the time used by A up to that step is bounded by a polynomial p(jlj; jxj), where l is the target and x 2 X is the largest counterexample seen so far3. As argued in the introduction, for learning subsumption and data retrieval learning frameworks we additionally assume that the signature of the target TBox is always known to the learner.

An important class of learning algorithms—in particular, all algorithms presented

in [ 12, 10, 16 ] fit in this class—always make equivalence queries with hypotheses h which are polynomial in the size of l and such that (h) (l), so that counterexamples returned by the EQl;X oracles are always positive. We say that such algorithms use positive bounded equivalence queries. 3

Polynomial Time Learnability In this section we prove polynomial time exact learnability of the DL-Lite9 and E Llhs

R data retrieval learning frameworks. These frameworks are instances of the general definition given above, where the concept expression D in a data retrieval example (A; D(a)) is an E LI concept expression in the DL-Lite9 framework and an E L concept R expression in the E Llhs framework, respectively.

The proof is by reduction to learning from subsumptions. We illustrate its idea for

E Llhs. To learn a TBox from data retrieval examples we run a learning from subsumptions algorithm as a ‘black box’. Every time the learning from subsumptions algorithm makes a membership or an equivalence query we rewrite the query into the data setting and pass it on to the data retrieval oracle. The oracle’s answer, rewritten back to the subsumption

2 We count each call to an oracle as one step of computation. 3 We assume some natural notion of a length of an example x and a learning concept l, denoted

jxj and jlj, respectively.

r,s A

A A A A r ... s r ... s

A A A A r ... s r ... s r

A s r

A r s

A s setting, is given to the learning from subsumptions algorithm. When the learning from subsumptions algorithm terminates we return the learnt TBox. This reduction is made possible by the close relationship between data retrieval and subsumption examples. For every TBox T and inclusions C v D, one can interpret a concept expression C as a labelled tree and encode this tree as an ABox AC with root C such that T j= C v D iff (T ; AC ) j= D( C ).

Then, membership queries in the subsumption setting can be answered with the help of a data retrieval oracle due to the relation between subsumptions and instance queries described above. An inclusion C v D is a (positive) subsumption example for some target TBox T if, and only if, (AC ; D( C )) is a (positive) data retrieval example for the same target T . To handle equivalence queries, we need to be able to rewrite data retrieval counterexamples returned by the data retrieval oracle into the subsumption setting. For every TBox T and data retrieval query (A; D(a)) one can construct a concept expression CA such that (T ; A) j= D(a) iff T j= CA v D. Such a concept expression CA can be obtained by unravelling A into a tree-shaped ABox and representing it as a concept expression. This unravelling, however, can increase the

ABox size exponentially. Thus, to obtain a polynomial bound on the running time of the

learning process, CA v D cannot be simply returned as an answer to a subsumption equivalence query. For example, for a target TBox T = f9rn:A v Bg and a hypothesis H = ; the data retrieval query (A; B(a)), where A = fr(a; a); s(a; a); A(a)g, is a positive counterexample. The tree-shaped unravelling of A up to level n is a full binary tree of depth n, as shown in Fig. 1. On the other hand, the non-equivalence of T and H can already be witnessed by (A0; B(a)), where A0 = fr(a; a); A(a)g. The unravelling of A0 up to level n produces a linear size ABox fr(a; a2); r(a2; a3); : : : ; r(an 1; an); A(a); A(a2); : : : ; A(an)g, corresponding to the left-most path in Fig. 1, which, in turn, is linear-size w.r.t. the target inclusion 9rn:A v B. Notice that A0 is obtained from A by removing the s(a; a) edge and checking, using membership queries, whether (T ; A0) j= q still holds. In other words, one might need to ask further membership queries in order to rewrite answers to data retrieval equivalence queries given by the data retrieval oracle into the subsumption setting.

We address the need of rewriting counterexamples by introducing an abstract notion

of reduction between different exact learning frameworks. To simplify notation, we assume that both learning frameworks use the same set of learning concepts L and only consider positive bounded equivalence queries. This definition of reduction can be easily extended to arbitrary learning frameworks and arbitrary queries.

We say that a learning framework F = (X; L; ) polynomially reduces to F0 = (X0; L; 0) if for some polynomials p1( ), p2( ) and p3( ; ) there exist a function f : X0 ! X and a partial function g : L L X ! X0, defined for every (l; h; x) such that jhj = p1(jlj), (h) (l) and x 2 X, for which the following conditions hold. – For all x0 2 X0 we have x0 2 0(l) if, and only if, f (x0) 2 (l); – For all x 2 X we have x 2 (l) n (h) if, and only if, g(l; h; x) 2 0(l) n 0(h); – f (x0) is computable in time p2(jx0j); – g(l; h; x) is computable in time p3(jlj; jxj) and l can only be accessed by calls to the membership oracle MEMl;X .

As in the case of learning algorithms, we consider every call to the oracle as one step

of computation. Notice also that even though g takes h as input, the polynomial time bound on computing g(l; h; x) does not depend on the size of h as g is only defined for h polynomial in the size of l.

Theorem 1. Let (X; L; ) and (X0; L; 0) be learning frameworks. If there exists a polynomial reduction from (X; L; ) to (X0; L; 0) and a polynomial learning algorithm for (X0; L; 0) that uses membership queries and positive bounded equivalence queries then (X; L; ) is polynomially exact learnable.

We use Theorem 1 to prove that DL-Lite9 and E Llhs TBoxes can be learned in

R polynomial time from data retrieval examples. We employ the following result: Theorem 2 ([ 12 ]). The DL-Lite9R and E Llhs subsumption learning frameworks are polynomial time exact learnable with membership and positive bounded equivalence queries.

As the existence of f is guaranteed by the following lemma, in what follows we prove the existence of g and establish the corresponding time bounds.

Lemma 1. Let L 2 fDL-Lite9 ; E Llhsg and let C v D be an L concept inclusion. Then

R (T ; AC ) j= D( C ) if, and only if, T j= C v D.

Polynomial Reduction for DL-Lite9R TBoxes We show for any target T and hypothesis H polynomial in the size of T that Algorithm 1 transforms every positive counterexample in polynomial time to a positive counterexample with a singleton ABox (i.e., of the form fA(a)g or fr(a; b)g). Using the equivalences (T ; fA(a)g) j= C(a) iff T j= A v C and (T ; fr(a; b)g) j= C(a) iff T j= 9r:> v C, we then obtain a positive subsumption counterexample, so g(l; h; x) is computable in polynomial time.

Given a positive data retrieval counterexample (A; C(a)), Algorithm 1 exhaustively

applies the role saturation and parent-child merging rules introduced in [ 12 ]. We say that an instance assertion C(a) is role saturated for (T ; A) if (T ; A) 6j= C0(a) whenever C0 is the result of replacing a role r by some role s 2 NR \ T with T 6j= r v s and T j= s v r, where T is the signature of the target TBox T known to the learner.

To define parent/child merging, we identify each E LI concept C with a finite tree TC

whose nodes are labeled with concept names and edges are labeled with roles in the standard way. For example, if C = 9t:(A u 9r:9r :9r:B) u 9s:> then Fig. 2a illustrates Algorithm 1 Reducing the positive counterexample

Find a singleton A0 A such that (T ; A0) j= C(a) but (H; A0) 6j= C(a) (membership queries) return (A0,C(a))

TC . Now, we say that an instance assertion C(a) is parent/child merged for T and A

if for nodes n1; n2; n3 in TC such that n2 is an r-successor of n1, n3 is an s-successor of n2 and T j= r s we have (T ; A) 6j= C0(a) if C0 is the concept that results from identifying n1 and n3. For instance, the concept in Fig. 2c is the result of identifying the leaf labeled with B in Fig. 2b with the parent of its parent.

We present a run of Algorithm 1 for T = fA v 9s:B; s v rg and H = fs v rg. Assume the oracle gives as counterexample (A; C(a)), where A = ft(a; b); A(b); s(a; c)g and C(a) = 9t:(A u 9r:9r :9r:B) u 9s:>(a) (Fig. 2a). Role saturation produces C(a) = 9t:(A u 9s:9s :9s:B) u 9s:>(a) (Fig. 2b). Then, applying parent/child merging twice we obtain C(a) = 9t:(A u 9s:B) u 9s:>(a) (Fig. 2c and 2d). (a)

r s t r A r

B (b) s s

A s t s

B (c) s t s s A

B (d) s t s

B A

Since (H; A) 6j= 9t:(A u 9s:B)(a), after Lines 4-6, Algorithm 1 updates C by choosing the conjunct 9t:(A u 9s:B). As C is of the form 9t:C0 and there is t(a; b) 2 A such that (T ; A) j= C0(b), the algorithm recursively calls the function “ReduceCounterExample” with A u 9s:B(b). Now, since (H; A) 6j= 9s:B(b), after Lines 4-6, C is updated to 9s:B. Finally, C is of the form 9t:C0 and there is no t(b; c) 2 A such that (T ; A) j= C0(c). So the algorithm proceeds to Lines 11-12, where it chooses A(b) 2 A. Since (T ; fA(b)g) j= 9s:B(b) and (H; fA(b)g) 6j= 9s:B(b) we have that T j= A v 9s:B and H 6j= A v 9s:B.

Lemma 2. Let (A; C(a)) be a positive counterexample. Then the following holds: 1. if C is a basic concept then there is a singleton A0 A such that (T ; A0) j= C(a); Algorithm 2 Minimizing an ABox A

return (A) 2. if C is of the form 9r:C0 (or 9r :C0) and C is role saturated and parent/child merged then either there is r(a; b) 2 A (or r(b; a) 2 A ) such that (T ; A) j= C0(b) or there is a singleton A0 A such that (T ; A0) j= C(a).

Lemma 3. For any target DL-Lite9 TBox T and hypothesis DL-Lite9 TBox H given

R R a positive data retrieval counterexample (A; C(a)), Algorithm 1 computes in time polynomial in jT j, jHj, jAj and jCj a counterexample C0(b) such that (T ; A0) j= C0(b), where A0 A is a singleton ABox.

Proof. (Sketch) Let (A; C(a)) be the input of “ReduceCounterExample”. The number

of membership queries in Line 3 is polynomial in jCj and jT j. If C has more than one conjunct then it is updated in Lines 4-6, so C becomes either (1) a basic concept or (2) of the form 9r:C0 (or 9r :C0). By Lemma 2 in case (1) there is a singleton A0 A such that (T ; A0) j= C(a), computed by Line 11 of Algorithm 1. In case (2) either there is a singleton A0 A such that (T ; A0) j= C(a), computed by Line 11 of

Algorithm 1, or we obtain a counterexample with a refined C. Since the size of the refined

counterexample is strictly smaller after every recursive call of “ReduceCounterExample”, the total number of calls is bounded by jCj. o

Using Theorem 2 and Theorem 1 we obtain:

Theorem 3. The DL-Lite9 data retrieval framework is polynomially exact learnable.

Polynomial Reduction for ELlhs TBoxes In this section we give a polynomial algo

rithm computing g for E Llhs. First we note that the concept assertion in data retrieval counterexamples (A; D(a)) can always be made atomic. Let T be the signature of the target TBox T .

Lemma 4. If (A; D(a)) is a positive counterexample then by posing polynomially many membership queries one can find a concept name A 2 T and an individual b 2 Ind(A) such that (A; A(b)) is also a counterexample.

Thus it suffices to show that given a positive counterexample (A; D(a)) with D 2 NC, one can compute an E L concept expression C bounded in size by jT j such that

(T ; fC(b)g) j= A(b) and (H; fC(b)g) 6j= A(b), where A 2 NC. As (T ; fC(b)g) j= A(b) if and only if T j= C v A, we obtain a positive subsumption counterexample.

Our algorithm for computing g is based on two operations: minimization, computed by

Algorithm 3 Computing a tree shaped ABox 1: function FINDTREE( A) 2: MINIMIZEABOX( A) 3: while there is a cycle c in A do 4: Unfold a 2 Ind(A) in cycle c 5: MINIMIZEABOX( A) 6: Let C be the concept expression corresponding to A with counterexample A(a). 7: return (C(a),A(a))

Algorithm 2, and unfolding. Algorithm 2 minimizes a given ABox with the following

rules.

(Concept saturate A with H) If A(a) 2= A and (H; A) j= A(a) then replace A by A [ fA(a)g, where A 2 NC \ T and a 2 Ind(A).

(Domain Minimize A with A(a)) If A(a) is a counterexample and (T ; A b) j= A(a) then replace A by A b, where A b is the result of removing from A all ABox assertions in which b occurs.

(Role Minimize A with A(a)) If A(a) is a counterexample and (T ; A r(b;c)) j= A(a) then replace A by A r(b;c), where A r(b;c) be obtained by removing a role assertion r(b; c) from A.

Lemma 5. Given a positive counterexample (A; D(a)) with D 2 NC, Algorithm 2 computes in polynomially many steps with respect to jAj, jHj, and jT j an ABox A0 such that jInd(A0)j jT j and (A0; A(b)) is a positive counterexample, for some A 2 NC and b 2 Ind(A0).

It remains to show that A can be made tree-shaped. We say that A has an (undirected) cycle if there is a finite sequence a0 r1 a1 ::: rk ak such that (i) a0 = ak and (ii) there are mutually distinct assertions of the form ri+1(ai; ai+1) or ri+1(ai+1; ai) in A, for 0 i < k. The unfolding of a cycle c = a0 r1 a1 ::: rk ak in a given ABox A is obtained by replacing c by the cycle c0 = a0 r1 a1 ::: rk ak 1 rk ba0 r1 bak 1 rk a0, where bai are fresh individual names, 0 i k 1, in such a way that (i) if r(ai; d) 2 A, for an individual d not in the cycle, then r(ai; d) 2 A; and (ii) if A(ai) 2 A then A(bai) 2 A.

We prove in the full version that after every unfolding-minimisation step in Algo

rithm 3 the ABox A on the one hand becomes strictly larger and on the other does not exceed the size of the target TBox T . Thus Algorithm 3 terminates after a polynomial number of steps yielding a tree-shaped counterexample.

Lemma 6. Algorithm 3 computes a minimal tree shaped ABox A with size polynomial in jT j and runs in polynomially many steps in jT j and jAj.

Using Theorem 2 and Theorem 1 we obtain:

Theorem 4. The E Llhs data retrieval framework is polynomially exact learnable. 4

Limits of Polynomial Time Learnability Our proof of non-polynomial learnability of general E L TBoxes from data retrieval

examples extends previous results on non-polynomial learnability of E L TBoxes from subsumptions [ 12 ]. We start by giving a brief overview of the construction in [ 12 ], show that it fails in the data retrieval setting and then demonstrate how it can be modified.

The non-learnability proof in [12] proceeds as follows. A learner tries to exactly

identify one of the possible target TBoxes fTL j L 2 Lng, for a superpolynomial in n set Ln defined below. At every stage of computation the oracle maintains a set of

TBoxes S, which the learner is unable to distinguish based on the answers given so far.

Initially S = fTL j L 2 Lng. It has been proved that for any E L inclusion C v D either TL j= C v D for every L 2 Ln or the number of L 2 Ln such that TL j= C v D does not exceed jCj. When a polynomial learner asks a membership query C v D the oracle answers ‘yes’ if TL j= C v D for every L 2 Ln and ‘no’ otherwise. In the latter case the oracle removes polynomially many TL such that TL j= C v D from S.

Similarly, for any equivalence query with hypothesis H asked by a polynomial learning

algorithm there exists a polynomial size inclusion C v D, which can be returned as a counterexample and that excludes only polynomially many TBoxes from S. Thus, every query to the oracle reduces the size of S at most polynomially in n, but then the learner cannot distinguish between the remaining TBoxes of our initial superpolynomial set S.

The set of indices Ln and the target TBoxes TL are defined as follows. Fix two

role names r and s. An n-tuple L is a sequence of role sequences ( 1; : : : ; n), where every i is a sequence of role names r and s, that is i = i1 i2 : : : in with ij 2 fr; sg. Then Ln is a set of n-tuples such that for every L; L0 2 Ln with L = ( 1; : : : ; n), L0 = ( 10; : : : ; n0), if i = j0 then L = L0 and i = j. There are N = b2n=nc different tuples in Ln. For every n > 0 and every n-tuple L = ( 1; : : : ; n) we define an acyclic E L TBox TL as the union of T0 = fXi v 9r:Xi+1 u 9s:Xi+1 j 0 i < ng and the following inclusions:

A1 v 9 1:M u X0 B1 v 9 1:M u X0 : : :

An v 9 n:M u X0

Bn v 9 n:M u X0 A

X0 u 9 1:M u u 9 n:M: where the expression 9 :C stands for 9 1:9 2 : : : 9 n:C, M is a concept name that ‘marks’ a designated path given by and T0 generates a full binary tree whose edges are labelled with the role names r and s and with X0 at the root, X1 at level 1 and so on.

In contrast to the subsumption framework, every TL can be exactly identified using

data retrieval queries. For example, as X0 u 9 1:M u u 9 n:M v A 2 TL, a learning from data retrieval queries algorithm can learn all the sequences in the ntuple L = ( 1; : : : ; n), by defining an ABox A = fX0(a1); r(a1; a2); s(a1; a2); : : : ; r(an 1; an); s(an 1; an); M (an)g and then proceeding with unfolding and minimizing A via membership queries of the form (TL; A) j= A(a1).

To show the non-tractability for data retrieval queries, we first modify S in such a

way that the concept expression which ‘marks’ the sequences in L = ( 1; : : : ; n) is now given by the set Bn of all conjunctions F1 u u Fn, where Fi 2 fEi; Eig, for 1 i n. Intuitively, every member of Bn encodes a binary string of length n with Ei encoding 1 and Ei encoding 0. For every L 2 Ln and every B 2 Bn we define TLB as the union of T0 and the concept inclusions defined above with B replacing M .

Then for any sequence of length n there exists at most one L 2 Ln, at most

one 1 i n and at most one B 2 Bn such that TLB j= Ai v 9 :B and TLB j= Bi v 9 :B. Notice that the size of each TLB is polynomial in n and so Ln contains superpolynomially many n-tuples in the size of each TLB, with L 2 Ln and B 2 Bn. Every TLB entails, among other inclusions, dn i=1 Ci v A, where every Ci is either Ai or Bi. Let n be the signature of the TBoxes TLB and consider a TBox T defined as the following set of concept inclusions: 9r:(E1 u E1) v (E1 u E1); (E1 u E1) v 9r:(E1 u E1); 9s:(E1 u E1) v (E1 u E1); (E1 u E1) v 9s:(E1 u E1); (Ei u Ei) v A for every 1 i n and every A 2 n \ NC

The basic idea of extending our TBoxes with T is that if a 2 (Ei u Ei)IA , for

an ABox A and individual a 2 Ind(A), then for all L 2 Ln and B 2 Bn, we have (TLB; A) j= D(b), where D is any E L concept expression over n and b 2 Ind(A) is any successor or predecessor of a (or a itself). This means that for each individual in A at most one B of the 2n binary strings in Bn can be distinguished by data retrieval queries. The following lemma enables us to respond to membership queries without eliminating too many L 2 Ln and B 2 Bn used to encode TLB in the set of TBoxes that the learner cannot distinguish.

Lemma 7. For any ABox A, any E L concept assertion D(a) over n, and any a 2 B Ind(A), if there is L 2 Ln and B 2 Bn such that (TL [ T ; A) j= D(a) then: – either (TLB [ T ; A) j= D(a), for every L 2 Ln and B 2 Bn, or

B – (TL [ T ; A) j= D(a) for at most jDj elements L 2 Ln, or

B – (TL [ T ; A) j= D(a) for at most jAj elements B 2 Bn.

The next lemma is immediate from Lemma 15 presented in [12]. It shows how the

oracle can answer equivalence queries eliminating at most one L 2 Ln used to encode TLB in the set S of TBoxes that the learner cannot distinguish.

Lemma 8. For any n > 1 and any E L TBox H in n with jHj < 2n, there exists an ABox A, an individual a 2 Ind(A) and an E L concept expression D over n such that (i) the size of A plus the size of D does not exceed 6n and (ii) if (H; A) j= D(a) then (TLB; A) j= D(a) for at most one L 2 Ln and if (H; A) 6j= D(a) then for every L 2 Ln

B we have (TL [ T ; A) j= D(a).

Then, by Lemmas 7 and 8, we have that: (i) any polynomial size membership query

can distinguish at most polynomially many TBoxes from S; and (ii) if the learner’s hypothesis is polynomial size then there exists a polynomial size counterexample that the oracle can give which distinguishes at most polynomially many TBoxes from S. Theorem 5. The E L data retrieval framework is not polynomially exact learnable. 5

Future Work We plan to consider an extension of the learning protocol in which arbitrary conjunctive

queries are admitted in queries to the domain expert/oracle. We then still have polynomial time learnability for E Llhs but conjecture non-polynomial time learnability for DL-Lite9 . R

Another extension is exact learnability for the Horn-extension of DL-Lite9 for which

we conjecture that polynomial time learnability still holds. R [17] H. Stuckenschmidt, C. Parent, and S. Spaccapietra, editors. Modular Ontologies: Concepts, Theories and Techniques for Knowledge Modularization, volume 5445 of Lecture Notes in Computer Science. Springer, 2009. [18] H. Wang, M. Horridge, A. Rector, N. Drummond, and J. Seidenberg. Debugging

OWL-DL ontologies: A heuristic approach. In The Semantic Web–ISWC 2005,

pages 745–757. Springer, 2005.

[1]

Angluin . Queries and concept learning . Machine Learning , 2 ( 4 ): 319 - 342 , 1987 .

[2]

Baader ,

Brandt , and

Lutz . Pushing the E L envelope . In IJCAI , pages 364 - 369 . Professional Book Center, 2005 .

[3]

Baader ,

Calvanese ,

McGuiness ,

Nardi , and

Patel-Schneider . The Description Logic Handbook: Theory, implementation and applications . Cambridge University Press, 2003 .

[4]

Bechhofer , I. Horrocks,

Goble , and

Stevens . Oiled: a reason-able ontology editor for the semantic web . In KI 2001: Advances in Artificial Intelligence , pages 396 - 408 . Springer, 2001 .

[5]

Borchmann and

Distel. Mining of E L-GCIs . In The 11th IEEE International Conference on Data Mining Workshops , Vancouver, Canada, 11 December 2011 . IEEE Computer Society.

[6]

Buitelaar ,

Cimiano , and B. Magnini, editors. Ontology Learning from Text: Methods , Evaluation and Applications . IOS Press, 2005 .

[7]

Calvanese , G. De Giacomo,

Lembo ,

Lenzerini , and

Rosati . Tractable reasoning and efficient query answering in description logics: The DL-Lite family . Journal of Automated reasoning , 39 ( 3 ): 385 - 429 , 2007 .

[8]

Cimiano ,

Hotho , and

Staab . Learning concept hierarchies from text corpora using formal concept analysis . J. Artif. Intell. Res. (JAIR) , 24 : 305 - 339 , 2005 .

[9]

Day-Richter ,

M. A.

Harris ,

Haendel ,

Lewis , et al. Obo-edit an ontology editor for biologists . Bioinformatics , 23 ( 16 ): 2198 - 2200 , 2007 .

[10]

Frazier and

Pitt . Learning From Entailment: An Application to Propositional Horn Sentences . In ICML, pages 120 - 127 , 1993 .

[11]

Konev ,

Ludwig ,

Walther , and

Wolter . The logical difference for the lightweight description logic EL . J. Artif. Intell. Res. (JAIR) , 44 : 633 - 708 , 2012 .

[12]

Konev ,

Lutz ,

Ozaki , and

Wolter . Exact learning of lightweight description logic ontologies . In Principles of Knowledge Representation and Reasoning: Proceedings of the Fourteenth International Conference, KR 2014 , Vienna, Austria, July 20-24 , 2014 , 2014 .

[13]

Lutz ,

Piro , and

Wolter . Description logic TBoxes: Model-theoretic characterizations and rewritability . In IJCAI , pages 983 - 988 , 2011 .

[14]

Ma and

Distel . Learning formal definitions for snomed CT from text . In Artificial Intelligence in Medicine - 14th Conference on Artificial Intelligence in Medicine, AIME 2013 , Murcia, Spain, May 29 - June 1, 2013 . Proceedings, pages 73 - 77 , 2013 .

[15]

M. A.

Musen . Prote´ge´ ontology editor. Encyclopedia of Systems Biology , pages 1763 - 1765 , 2013 .

[16]

Reddy and

Tadepalli . Learning First-Order Acyclic Horn Programs from Entailment . In in Proceedings of the 15th International Conference on Machine Learning; (and Proceedings of the 8th International Conference on Inductive Logic Programming , pages 23 - 37 . Morgan Kaufmann, 1998 .