-

Balancing Brave and Cautious Query Strategies in Ontology Debugging

Patrick Rodler

Kostyantyn Shchekotykhin

Philipp Fleiss

Gerhard Friedrich ?

0 0 Alpen-Adria-Universität Klagenfurt Universitätsstrasse 65-67 9020 Klagenfurt , Austria

Sequential ontology debugging is aimed at the efficient discrimination between diagnoses, i.e. sets of axioms which must be altered or deleted from the ontology to restore consistency. By querying additional information the number of possible diagnoses can be gradually reduced. The selection of the best queries is crucial for minimizing diagnosis costs. If prior fault probabilities (FPs) are available, the best results are achieved by entropy based query selection. Given that FPs are only weakly justified, however, this strategy bravely suggests suboptimal queries although more cautious strategies should be followed. In such a case, it is more efficient to follow a no-risk strategy which prefers queries that eliminate 50% of diagnoses independently of any FPs. However, choosing the appropriate strategy in advance is impossible because the quality of given priors cannot be assessed before additional information is queried. We propose a method which combines advantages of both approaches. On the one hand, the method takes into account available meta information in terms of FPs and the user's confidence in these. On the other hand, the method can cope with weakly justified FPs by limiting the risk of suboptimal query selections based on the user's confidence in the FPs. The readiness to take risk is adapted depending on the outcome of previous queries. Our comprehensive evaluation shows that the proposed debugging method significantly reduces the number of queries compared to both the entropy based and the no-risk strategy for any choice of FPs.

Support of ontology development and maintenance is an important requirement for the extensive use of Semantic Web technologies. However, the correct formulation of logical descriptions is an error-prone task even for experienced knowledge engineers. Ontology debuggers [8, 3, 1] assist ontology development by identifying sets of axioms (called diagnoses) that have to be modified s.t. inconsistencies or unwanted entailments are avoided. In many cases the main problem of debugging is the big number of alternative diagnoses.

To solve the problem a set of heuristics to rank diagnoses was proposed by [4]. However, this solution is not satisfactory as it cannot be guaranteed that the top-ranked ? The research project is funded by grants of the Austrian Science Fund (Project V-Know, contract 19996) diagnosis is the target diagnosis and no further assistance for diagnoses discrimination is provided. Therefore, a debugging method based on active learning was proposed by [9]. The approach exploits the fact that an ontology without the axioms of one diagnosis may have different entailments than the ontology without the axioms of another diagnosis. Using these entailments the algorithm queries the user or another oracle whether a set of axioms is entailed by the target ontology or not. The answers are used to eliminate disagreeing diagnoses. The algorithm continues to query the oracle until a diagnosis (the target diagnosis) is significantly more probable than any other. To minimize the number of queries the algorithm uses some meta information, namely the fact that some errors are more common than others [6], i.e. that different language constructs have different fault probabilities. These probabilities can either be extracted from the logs of previous sessions or specified by the user expressing own beliefs. This advantage might be lost if unsuitable probabilities are provided where the target diagnosis is rather improbable. In this case the entropy-based query selection as well as active learning methods might require significantly more queries than strategies like split-in-half which behave independently of any meta information.

We present a hybrid method that outperforms both types of strategies by combining their benefits while overcoming their drawbacks. On the one hand, our method takes advantage of the given meta information as long as good performance is achieved. On the other hand, it gradually gets more independent of meta information if suboptimal behavior is measured. Moreover, our strategy takes into account a user’s subjective quality estimation of the meta information. In this way a user may decide to take influence on the algorithm’s behavior or let the algorithm act freely and find the best strategy on its own. To find the best strategy, our method constantly improves the quality of meta information as well as it learns to act profitably in all kinds of situations by exploiting its adaptiveness. In our comprehensive evaluation we demonstrate for various types and quality of meta information that our approach is very robust and that query costs in comparison to existing solutions are substantially minimized in all situations.

The motivation of our work as well as basic concepts are provided in Section 2. Section 3 gives the details of the suggested approach. Implementation details are discussed in Section 4 and evaluation results are described in Section 5. 2

Basic Concepts and Motivation Ontology Debugging deals with the following problem:

Problem Definition 1 (Ontology Debugging) Given a diagnosis problem instance hO; B; TC +; TC i where O = T [ A denotes an ontology, i.e. a set of terminological axioms T and a set of assertional axioms A, B denotes a set of background knowledge axioms (which are assumed to be true), TC + is a set of positive test cases, and TC is a set of negative test cases, such that – O is inconsistent/incoherent1

and/or 1 Note that throughout the rest of this paper we consider debugging of inconsistent ontologies.

However, the same approach can also be used to restore coherency.

and/or The reasonable assumption is made that the background theory is consistent with the positive and negative test cases, i.e. B [ (Vt+2TC + t +) [ :t is consistent for all t 2 TC . The task is then to find a diagnosis D O s.t. there is a set of axioms EX s.t. 1 and 2 and 3 hold for O := (O n D) [ EX: 1. O [ B is consistent/coherent 2. O [ B j= t + 8t + 2 TC + 3. O [ B 6j= t 8t 2 TC O denotes the target ontology and EX is called extension. A diagnosis D is minimal iff there is no D0 D s.t. D0 is a diagnosis. A diagnosis D gives complete information about the correctness of each axiom axk 2 O, i.e. all axi 2 D are assumed to be faulty and all axj 2 O n D are assumed to be correct.

In ontology debugging two types of operations can be distinguished: diagnosis and repair. At first, one needs to find a set of axioms, i.e. a diagnosis D, which must be deleted from the ontology O in order to restore consistency (requirements 1 and 3). This is the task of ontology diagnosis. Then it might be necessary to add appropriate axioms EX to O n D in order to give the ontology the intended semantics/entailments (requirement 2) because by deleting axioms from O some intended entailments might be eliminated as well. This is the topic of ontology repair. In this work we focus on ontology diagnosis. For EX we use the following approximation: EX Vt+2TC + t +. Note that EX must at least contain all axioms in TC +.

Ontology diagnosis can be exemplified by the following simple OWL DL ontology Oex = T [ A with terminology T : ax 1 : A v 8s:B ax 4 : D v E u E1 ax 2 : B v :9v::C ax 5 : E v F ax 3 : C v D u :D1 ax 6 : F v R and the assertions A : fA(w); s(w; w); v(w; w); :R(w)g. Let us assume that all assertions are added to the background theory, i.e. B = A, and both sets TC + and TC are empty. Then the ontology Oex = T is inconsistent and the set of minimal diagnoses for this diagnosis problem instance hT ; A; ;; ;i is D = fD1 : [ax 1]; D2 : [ax 2]; D3 : [ax 3]; D4 : [ax 4]; D5 : [ax 5]; D6 : [ax 6]g.

An algorithm for generating all possible diagnoses for a given diagnosis problem instance is presented in [1]. However, one major issue in ontology diagnosis is that there may be a huge number n of possible diagnoses D = fD1; : : : ; Dng, depending on the size and the properties of the ontology O. This problem is also manifested in the above example where six possible diagnoses are found for an ontology consisting of merely six axioms. To tackle this problem, the authors in [9] exploit the fact that ontologies O n Di and O n Dj have different entailments for Di; Dj 2 D (Di 6= Dj ) in order to discriminate between potential diagnoses D. They discuss methods which gather new information by posing queries about intended entailments of the target ontology to an oracle and utilize this information to reduce the set of potential diagnoses as quickly as possible. So, the following subproblem of ontology diagnosis is addressed:

Algorithm 1: Query Generation

Input: diagnosis problem instance hO; B; TC +; TC i, set of diagnoses D, a reasoner RE instantiated to provide all entailments of types E ET

Output: a set of queries X 1 foreach DX D do 2 X getEntailments(RE; B [ TC + [ TD2DX O n D); 3 if X 6= ; then

4 foreach Dr 2 D n DX do

5 if O n Dr [ B [ TC + j= X then DX DX [ fDrg; 6 else if O n Dr [ B [ TC + j= :X then D:X D:X [ fDrg; 7 else D; D; [ fDrg; 8 9 return X;

X [ n(X; hDX ; D:X ; D;i)o

Problem Definition 2 (Diagnosis Discrimination) Given the set of possible diagnoses

D = fD1; : : : ; Dng w.r.t. hO; B; TC +; TC i, find a sequence (X1; : : : ; Xq) with minimal q where Xi 2 X for i = 1; : : : ; q is a query to an oracle and X is the set of all possible queries , such that the set of possible diagnoses can be reduced to D = fD g where D 2 fD1; : : : ; Dng is called the target diagnosis. The order of queries is crucial and affects q.

The function of the mentioned oracle to answer queries is usually realized by the user who created the ontology or by an expert in the domain modeled by the ontology. A set of axioms Xj is called a query if it is a subset of common entailments of a set of ontologies (O n Di) where Di 2 DjX D. In description logics there is a number of different entailment types ET . However, in general not all of these types are used to construct queries as this could produce high numbers of entailments and would make the process of query answering very time-consuming and inefficient. In [9], for example, queries are generated by considering only entailments obtained by classification and realization. A query Xj means asking the oracle (O j= Xj ?) and is answered either by true, i.e. O j= Xj , or by false, i.e. O 6j= Xj . A query Xj partitions the set of diagnoses D into hDjX ; Dj:X ; Dj;i such that: – 8 Di 2 DjX : (O n Di) [ B [ (Vt+2TC + t +) j= Xj , – 8 Di 2 Dj:X : (O n Di) [ B [ (Vt+2TC + t +) j= :Xj , and – Dj; = D n (DjX [ Dj:X ).

A query Xj is stored together with its partition as (Xj ; hDjX ; Dj:X ; Dj;i). Algorithm 1 shows a procedure for computing all queries X for given D. The function GETENTAILMENTS(RE ; axioms) calls the reasoner RE to get all entailments X of type E 2 ET of axioms and returns ; if axioms is inconsistent. Applied to our example ontology Oex, this algorithm returns the set of all possible queries shown in Table 1.

If a query Xj is answered positively, then Xj is added to the positive test cases, i.e. TC + [fXj g, and a diagnosis Di 2 D is a diagnosis for hO; B; TC + [fXj g; TC i iff (OnDi)[B[TC +[Xj 6j= t for all t 2 TC . If a query Xj is answered negatively, then Xj is added to the negative test cases, i.e. TC [ fXj g, and a diagnosis Di 2 D

Query

DiX

Di:X

Di; is a diagnosis for hO; B; TC +; TC [ fXj gi iff (O n Di) [ B [ TC + 6j= Xj . So, given that Xj = true, the set of eliminated diagnoses is Dj:X and the set of remaining diagnoses is DjX [ Dj;. Otherwise, the set of rejected diagnoses is DjX and the set of remaining diagnoses is Dj:X [ Dj;.

A generic diagnoses discrimination algorithm is described in algorithm 2. The essential part of this algorithm is the GETBESTQUERY function whose implementation determines its performance decisively. In [9], two different implementations of GETBESTQUERY are studied, namely split-in-half and entropy-based query selection.

The entropy-based approach uses meta information about probabilities pt that the user makes a fault when using a syntactical construct of type t 2 CT where CT is the set of construct types available in the used ontology expression language. For example, 8, 9, v, :, t, u are some OWL DL construct types. These fault probabilities pt are assumed to be independent and used to calculate fault probabilities of axioms as p(ax k) = 1

Y (1 t2CT pt)n(t) where n(t) is the number of occurrences of construct type t in axk. These probabilities are in turn used to determine fault probabilities of diagnoses Di 2 D as p(Di) =

Y ax r2Di p(ax r)

Y ax s2OnDi (1 p(ax s)): ( 1 ) ( 2 ) The strategy is then to select the query which minimizes the expected entropy in the diagnosis probabilities after the query. This means that uncertainty should be minimized and information gain should be maximized. According to [5] this is equivalent to choosing the query Xj which minimizes the following scoring function:

X aj2ftrue;falseg scent(Xj ) = p(Xj = aj ) log2 p(Xj = aj ) + p(Dj;) + 1 ( 3 ) This function is minimized by queries Xj with p(DjX ) = p(Dj:X ) = 0:5. So, entropybased query selection favors queries whose outcome is most uncertain. After each query Xj , the new information obtained by the feedback of the oracle is incorporated in the meta information by updating the diagnosis probabilities according to the Bayesian formula: p(DijXj = aj ) = p(Xj = aj jDi)p(Di) p(Xj = aj ) ( 4 ) where aj 2 ftrue; falseg and p(Xj = true) =

X Dr2DjX p(Dr) + 1 X 2

Dk2Dj; and p(Xj = aj jDk) := 1=2 for Dk 2 Dj; and in f0; 1g for Dr 2 DjX [ Dj:X according to the explanations on page 4.

The split-in-half approach, on the other hand, acts completely independently of any meta information and selects the query Xj which minimizes the following scoring function: scsplit(Xj ) = jDjX j jDj:X j + jDj;j So, the split-in-half strategy prefers queries which eliminate half of the diagnoses, independently of the query outcome.

The result of the evaluation in [9] is that entropy-based query selection reveals better performance than split-in-half if the given meta information is suitable. However, the question is how to choose the fault probabilities profitably in that they favor the unknown target diagnosis. By setting probabilities equally for each user, e.g. according to the results of the study conducted by [6], it cannot be taken for granted that this meta information will be correct for all users. On the contrary, since every user is prone to individual and thus generally different fault patterns, we would rather need empirical data for each individual user in order to set probabilities appropriately. But unfortunately there is neither a log file for each user which could be used as a documentation of common individual faults nor any extensive user-study which would identify user profiles and allow us to set probabilities more individually. Hence, an objective approach to assigning fault probabilities for arbitrary users will generally not bear fruit. As a consequence, one may often need to rely on a vague subjective estimation of the fault probabilities by the user which might not always conform to the de facto fault probabilities. In fact, there are situations where the split-in-half approach outperforms the entropy-based strategy, namely when the fault probabilities disfavor the target diagnosis D , i.e. assign low probability to D . In this case, entropy-based query selection can be mislead by this poor meta information and might prefer queries that do not favor the target diagnosis.

To illustrate this, let us assume that the user who wants to debug our example ontology Oex specifies the following fault probabilities: p8 = 0:48, p9 = 0:11, p: = 0:06, pu = 0:03, pv = 0:03. Application of formulas ( 1 ) and ( 2 ) leads to diagnoses fault probabilities p(D1) = 0:634, p(D2) = 0:2, p(D3) = 0:084, p(D4) = 0:04, p(D5) = 0:02, p(D6) = 0:02. Using entropy-based selection, X1, i.e. (Oex j= fB(w)g?), will be identified as the optimal query since the difference between probabilities of positive (p(X1 = true) = 0:365) and negative (p(X1 = false) = 0:634) outcome is less than for all other queries fX2; X3; X4; X5g (see Table 1). However, note that this query could eliminate only a single diagnosis, i.e. D1, if the unfavorable answer is given. Hence, we call X1 a high-risk query. If, e.g. D5 is assumed to be the target diagnosis, i.e. D = D5, then X1 will be answered positively and all but one diagnoses are still possible. The elimination rate of this query is therefore 1=6. The probability update given by formula ( 4 ) then yields p(D2) = 0:316, p(D3) = 0:133, p(D4) = 0:064, p(D5) = 0:031, p(D6) = 0:031. As a next query, X2 will be preferred, but will be answered unfavorably as well resulting again in the elimination of only one single diagnosis. This procedure can be further executed, i.e. by querying X3 and X4, until the target diagnosis D5 will be identified by getting the answer false to query X5. So, by applying scent all five queries are required in order to find the target diagnosis. If queries are selected by scsplit, on the contrary, three queries will suffice. At first, query X3 will be chosen because it eliminates half of all diagnoses in any case. We call such a query a no-risk query. The answer will be true and querying X5 after X4 = true will finally reveal the target diagnosis D = D5.

This example demonstrates that the no-risk strategy scsplit (three queries) is more suitable than scent (five queries) for fault probabilities which disfavor the target diagnosis. Let us suppose, on the other hand, that probabilities are assigned reasonably in our example, e.g. that D = D2. Then it will take the entropy-based strategy only two queries (X1; X2) to find D while split-in-half will still require three queries (X3; X2; X1). The complexity of scent in terms of required queries varies between O( 1 ) in the best and O(jDj) in the worst case depending on the appropriateness of the fault probabilities. On the other hand, scsplit always requires O(log2 jDj) queries.

We learn from this example that the best choice of discrimination strategy depends on the quality of the meta information. Unfortunately, however, it is impossible to assess the quality of fault probabilities before the debugging session starts since this would require us to know the target diagnosis in advance. Rather, we can exploit the additional information gathered by querying the oracle in order to estimate the quality of probabilities. Let us consider e.g. the first query X1 selected by scent in our example. If this query is answered unfavorably, i.e. by true, then many more diagnoses remain than are eliminated. Since the observed outcome was less likely, the partition D1X with lower average diagnosis probability survives. This could be a hint that the probabilities are only weakly justified. The new strategy we present in this work incorporates exactly this information, i.e. the elimination rate achieved by a query, when choosing the successive query by adapting the maximum allowed ‘query-risk’. Our method combines the advantages of both the entropy-based approach and the split-in-half approach. On the one hand, it exploits the given meta information Info if Info is of high quality, but, on the other hand, it quickly loses trust in Info and becomes more cautious if some indication is given that Info is of poor quality. 3

Risk Optimization Strategy for Query Selection

Our risk optimization algorithm is a dynamic learning procedure which on the one hand learns by reinforcement how to select optimal queries and on the other hand continually improves the a-priori given meta information based on new knowledge obtained through queries to an oracle. The behavior of our algorithm can be co-determined by the user. Therefore, the algorithm takes into account the user’s doubt about the meta information, i.e. the fault probabilities pt for t 2 CT , by allowing them to define the initial cautiousness c of the algorithm as well as a cautiousness interval [c; c] where c; c; c 2 [0; bjDj=2c =jDj] and c c c. The interval [c; c] constitutes the set of all admissible cautiousness values the algorithm may take during the debugging session. High trust in the meta information is reflected by specifying a low minimum required cautiousness c. If the user is unsure about the rationality of the fault probabilities this can be expressed by setting c to a higher value. The value assigned to c indicates the maximum admissible cautiousness of the algorithm and can be interpreted as the minimum chance a user wants to preserve of performing better than a no-risk strategy.

This additional cautiousness information guides the behavior of the algorithm when selecting queries. The relationship between cautiousness c and queries is formalized by the following definitions: Definition 1 (Cautiousness of a Query). We define the cautiousness caut (Xi) of a query Xi as follows: caut (Xi) := min jDiX j; jDi:X j jDj

2 A query Xi is called braver than query Xj iff caut (Xi) < caut (Xj ). Otherwise Xi is called more cautious than Xj . A query with highest possible cautiousness is called no-risk query.

Definition 2 (Elimination Rate). Given a query Xi and the corresponding answer ai 2 ftrue; falseg, the elimination rate e(Xi; ai) is defined as follows: e(Xi; ai) = 8 jDi:X j if ai = true < jDj : jDiX j jDj if ai = false The answer ai to a query Xi is called favorable iff it maximizes the elimination rate e(Xi; ai). Otherwise ai is called unfavorable.

So, the cautiousness caut (Xi) of a query Xi is exactly the minimal elimination rate, i.e. caut (Xi) = e(Xi; ai) given that ai is the unfavorable query result. Intuitively, the user-defined cautiousness c is the minimum percentage of diagnoses in D which should be eliminated by the successive query. For brave queries the interval between minimum and maximum elimination rate is larger than for cautious queries. For no-risk queries it is minimal.

Definition 3 (High-Risk Query). Given a query Xi and cautiousness c, then Xi is called a high-risk query iff caut (Xi) < c, i.e. the cautiousness of the query is lower than the cautiousness required by the user. Otherwise Xi is called non-high-risk query. By HR(X) X we denote the set of all high-risk queries. The set of all queries X can be partitioned in high-risk queries and non-high-risk queries. Example 1. Reconsider the example ontology Oex of Section 2 and let c = 0:25. Then query X2 has cautiousness caut (X2) = 2=6 = 0:33 0:25 = c and is thus a non-highrisk query. Query X1, on the contrary, is a high-risk query because caut (X1) = 1=6 = 0:17 < 0:25 = c. However, if X1 is not asked as first but for example as third query after D3; D4; D5; D6 have been eliminated by X3 and X2 (compare with the behavior of split-in-half for Oex in Section 2), the cautiousness caut (X1) = 0:5 which means that X1 is a non-high-risk query in this case.

Algorithm 3 describes our Risk Optimization Algorithm. Its core is the implementation of the GETBESTQUERY function in Algorithm 2 which is explained in the following (with the denotation of the respective method in brackets). Further details concerning Algorithm 3 are provided in Section 4. 1. (GETMINSCOREQUERY) Determine the best query Xi 2 X according to scent.

That is:

Xi = arg min(scent(Xk)):

Xk2X 2. (GETQUERYCAUTIOUSNESS) If Xi is a non-high-risk query, i.e. caut (Xi) c, select Xi. In this case Xi is the query with maximum information gain among all queries X and additionally guarantees the required elimination rate specified by c. 3. (GETALTERNATIVEQUERY) Otherwise, select the query Xj 2 X (Xj 6= Xi) which has minimal score scent among all least cautious non-high-risk queries L. That means:

Xj = arg min(scent(Xk)):

Xk2L where L = fXr 2 X n HR(X) j 8Xt 2 X n HR(X) : caut (Xr) caut (Xt)g.

If there is no such query Xj 2 X, then select Xi from step 1. 4. Given the query outcome Xs = as, update the meta information as per steps 4a and 4b. If step 2 was successful, then Xs = Xi, otherwise Xs = Xj . (a) (UPDATECAUTIOUSNESS) Update the cautiousness c depending on the elimination rate e(Xs; as) as follows: c c + cadj ( 5 ) where cadj denotes the cautiousness adjustment factor which is defined as follows: cadj := adj 2 (c c) where adj := e(Xs; as) ( 6 ) and 2 (0; 21 ) an arbitrary real number. The factor 2 (c c) in formula ( 6 ) simply regulates the extent of the cautiousness adjustment depending on the interval length c c. The more crucial factor in the formula is adj which indicates whether cadj is positive or negative, i.e. whether the cautiousness is increased or decreased. Note that adj < 0 holds for any favorable query result, independently of the cautiousness caut (Xs) of the asked query Xs. This has the following reasons: For even jDj, the elimination rate resulting from the favorable j jDj 2 jDj k query outcome is

b j D2j c=jDj, which is the maximal value the minimal elimination rate of a query can take (see Definition 1). But, also b j D2j c=jDj < b j D2j c=jDj holds since jDj is even. Given that jDj is odd, the elimination rate for any favorable query result is > b j D2j c=jDj = b j D2j c=jDj. If the value c computed in ( 5 ) is outside the user-defined cautiousness interval [c; c], it is set to c if c < c and to c if c > c. Positive cadj is a penalty telling the algorithm to get more cautious, whereas negative cadj is a bonus resulting in a braver behavior of the algorithm. (b) (GETPROBABILITIES) Update the diagnosis probabilities p(Dr) for Dr 2 D according to formula ( 4 ) on page 6. Section 4 gives more details on this method.

To illustrate the behavior of our algorithm, we revisit the example ontology Oex with six diagnoses discussed in Section 2. Again, we first assume that D = D2. Further on, let the user be quite unsure whether to trust or not trust in the fault probabilities (see page 6) and thus define cautiousness c := 0:3 and the cautiousness interval [c; c] := [0; b 62 c=6] = [0; 0:5]. Then, our algorithm will first test if the query X1 which has minimal score scent is admissible w.r.t. c. Therefore, caut (X1) = 1=6 = 0:17 is calculated and compared with c = 0:3. Since caut (X1) < c, X1 is denied. As a consequence, the algorithm seeks the query with minimal scent among all least cautious non-high-risk queries, i.e. those which guarantee elimination of a minimal number d0:3 6e = 2 diagnoses. The only least cautious non-high-risk query is X2, so it is chosen as first query and answered by false. Therefore, the set of possible diagnoses is reduced to D = fD1; D2g. Then, the only query allowing to discriminate between D1 and D2, namely X1, is selected and D = D2 is identified. For D = D5, the algorithm acts as follows: The first query is again X2 which is answered by true in this case. Hence, two diagnoses D1 and D2 are dismissed yielding an elimination rate e(X2; true) = 2=6. Applying formula ( 6 ) then yields cadj = 0, i.e. c remains unchanged. Thus the next query should eliminate at least d0:3 (6 2)e = 2 diagnoses. The only candidate X4 (=true) is selected, leaving open only one possible last query X5. So, in both situations D = D5 (three queries) and D = D2 (two queries) our method performed as well as the better of scent and scsplit, respectively.

4 Implementation Details

Our Risk Optimization Algorithm (Algorithm 3) takes the following inputs: ( 1 ) A diagnosis problem instance hO; B; TC +; TC i, ( 2 ) meta information Info=(P; C), comprising on the one hand fault probabilities P = fptjt 2 CT g where CT is the set of syntactical construct types available in the used ontology expression language, and on the other hand and user-defined cautiousness parameters C = (c; c; c), ( 3 ) the number n of considered leading diagnoses per iteration, and ( 4 ) a stopping criterion stop_cond. Input 3 is needed because, in general, it is impossible to consider all minimal diagnoses for a given diagnosis problem instance at once. A simple way of alleviating this problem is to consider only minimal diagnoses since the set of all diagnoses is sufficiently described by the set of all minimal diagnoses. This holds because any superset of a diagnosis is also a diagnosis due to the fact that eliminating an axiom from a consistent

Algorithm 3: Risk Optimization Algorithm

ontology cannot make the ontology inconsistent. However, this does not solve the problem as there might also be a huge number of minimal diagnoses. The problem with a large number of diagnoses is caused by the time complexity needed to find diagnoses. In order to explain this problem in more detail, we first provide a short review of the diagnosis computation method described in [1] which we employ to calculate diagnoses. This method exploits the notion of a conflict set which is defined as follows: Definition 4 (Conflict Set). Given a diagnosis problem hO; B; TC +; TC i, a conflict set CS O is a set of axioms s.t. there is a t 2 TC for which CS [ B [ TC + [ t is inconsistent. A conflict set CS is called minimal iff there is no CS0 CS such that CS0 is a conflict set.

Simply put, a conflict set CS is an inconsistent part of the ontology, i.e. CS does not meet requirements 1 or 3 in Problem Definition 1. If a diagnosis problem instance does not include any conflict set, then this problem instance is either solved, i.e. meets requirements 1, 2 and 3 in Problem Definition 1, or can be easily solved by adding the desired positive test cases TC + if requirement 2 is not met. So, the idea is to resolve every conflict set in order to solve a given diagnosis problem instance. This is achieved by deleting or changing at least one axiom of a minimal conflict set. The following proposition [7] summarizes these thoughts:

Proposition 1. D

minimal conflict sets.

O is a (minimal) diagnosis iff D is a (minimal) hitting set of all In [1], the QUICKXPLAIN algorithm (in short QX ) introduced by [2] is used to compute minimal conflict sets. QX (B; A) takes as inputs a set of trusted background knowledge axioms B and a set of axioms A and returns a minimal conflict set CS A w.r.t. B. If A [ B is consistent, QX outputs consistent. If B is inconsistent, the output is ;. In order to compute minimal diagnoses, a tree called HS-Tree is built with minimal conflict sets as nodes. The path from the root node nd 0 to a node nd ik on level k is denoted by HS (nd ik). The root of this tree is a minimal conflict set obtained by calling QX (B [ TC + [ t ; O) for t 2 TC until it first outputs CS(nd 0) 6= consistent. For each axiom axi 2 CS(nd jk 1) there is a path HS (nd ik) = HS (nd jk 1) [ faxig to at n;oOdenndHikS. (Inndoikrd))eristoruconmfoprutte a m2inTimCal cuonntflilicatnseotuftoprutnoCdSe (nnddikik,)Q6=X (cBon[siTstCen+t i[s returned. If all outputs are consistent, i.e. O n HS (nd ik) [ B [k TC + [ t is consistent for all t 2 TC , then a minimal hitting set D = HS (nd i ) of all minimal conflict sets, i.e. a minimal diagnosis, is found and node nd ik is not further expanded.

The calculation of all minimal diagnoses at once would thus involve the complete construction of the HS-Tree. However, in the worst case this requires an exponential (in nCS ) number of calls to QX (B; A) which itself requires O(log2 jAj) calls to an underlying reasoner where nCS denotes the total number of conflict sets. Also, generating all queries (w.r.t. certain entailment types) for a number k of diagnoses quickly becomes impracticable as k grows, since all non-trivial 2k 2 partitions are explored (see Algorithm 1). Therefore it is generally not feasible to examine all minimal diagnoses at the same time.

In our approach, we tackle this issue by introducing a leading diagnoses window of fixed size n, i.e. jDj = n. That is, when the debugging session starts, the first n minimal diagnoses are computed as explained before. However, instead of calculating the n minimal diagnoses by ascending cardinality as in [9], our method implements a uniform-cost search on the HS-Tree to find the n most probable minimal diagnoses. This uniform-cost search is guided by the meta information, i.e. the fault probabilities P = fptjt 2 CT g, provided as an input to our algorithm. From these fault probabilities, the probabilities p(axj ) of axioms axj 2 O being erroneous can be determined according to formula ( 1 ). These probabilities are then used to decide which node to expand next in the search. Namely, always the node nd i 2 V with path HS (nd i) from the root node is expanded where p(HS (nd i)) = Qaxj2HS(ndi) p(axj ) is maximal for nd i among all nodes nd k 2 V where V is the set of all generated nodes in the HS-Tree. In this manner, most probable diagnoses are generated first by the GETDIAGNOSES function. If g n leading diagnoses are eliminated by a query, then, in the next iteration, the function supplies the next most probable n g diagnoses, and so on. Visually spoken, one could say that the leading diagnoses window moves on for e(Xi; ai) jDj steps after each query Xi to include the next most probable e(Xi; ai) jDj diagnoses.

The GETPROBABILITIES function in each iteration calculates the diagnosis probabilities for the current set of leading diagnoses D taking into account all information gathered by querying the oracle so far. To that end it calculates the adjusted diagnosis probabilities padj (Di) for Di 2 D as padj (Di) = (1=2)z p(Di) where z is the number of precedent queries Xk where Di 2 D;k and p(Di) is the probability of Di calculated directly from the preliminarily given fault probabilities P by formulas ( 1 ) and ( 2 ). Afterwards the probabilities padj (Di) are normalized. Note that z can be computed from TC + and TC which comprise all query answers. This way of updating probabilities is exactly in compliance with the Bayesian theorem given by formula ( 4 ). The rest of the methods used in Algorithm 3 is explained in Section 3. 5

Evaluation

We tested our Risk Optimization Algorithm (R) against the entropy-based approach (E ) and the split-in-half approach (S ) which were explained in Section 2. As adaptiveness is one of the central achievements of our approach, the main goal of this evaluation is to show its robustness in a variety of possible application scenarios. A major obstacle, however, is that only a small number of inconsistent/incoherent ontologies are available and they capture only a few situations, like ontology learning or tutorial cases. So, we decided to simulate diagnosis sessions covering a broader variety of situations using a model M which takes as inputs a set of total diagnoses, fault probabilities, and the parameters explained throughout the rest of this section. M is realistic because the values of parameters used in M (e.g. for query generation and diagnosis elimination) were determined by taking averages (*) after analyzing the structure and dependencies between diagnoses as well as between diagnoses and queries using real-world inconsistent/incoherent ontologies as in [9]. M is fair since each approach A 2 fE ; S ; Rg was tested under exactly the same conditions (i.e. diagnoses, queries, probabilities, etc.) by means of random seeds. And M enables exhaustive automated tests.

We randomly instantiated M in each test iteration featuring different types of fault probability distributions PT 2 fextreme; moderate; uniformg as well as different cases in terms of quality of probabilities QU 2 fgood ; average; bad g. Each approach A was tested for the nine different settings in P T QU . To construct different types of fault probability distributions P T , it is sufficient to change the magnitude of the random ratio rij = p(Di)=p(Dj ) between diagnosis probabilities p(Di) > p(Dj ) under the assumption that Di is a neighbor to Dj when diagnoses Di are ordered descending by p(:). This is because (i) our approach uses the fault probabilities pt only at the very beginning to calculate the diagnoses probabilities p(Di) and subsequently only works with probabilities p(Di), and because (ii) our implementation of the diagnosis generation supplies diagnoses in order w.r.t. their probability (see Section 5). So, extreme, moderate and uniform refer to a ‘high’, ‘medium’ and equal-to-one rij , respectively. Concretely, an extreme distribution was instantiated by generating uniformly distributed random values ri 2 [0;1), one for each diagnosis Di, and weighting each ri by wi = (1=1:5)i. That is, p(Di) = ri wi. As a second step, the probabilities p(Di) were normalized and diagnoses ordered descending by p(:). A moderate distribution was generated analogously, but with different weights wi = (1=1:1)i. To obtain a uniform distribution we simply assigned equal probabilities to all diagnoses. The quality QU of fault probabilities is measured in terms of the position of the target diagnosis D in the list of all diagnoses ordered by descending probability. By good, average and bad quality we mean that D is located randomly within the first, fourth and ninth tenth of this ordered list.

Our test setting, implemented in Java2, was the following. For each A and each precondition in P T QU we performed 35 iterations in each of which we instantiated M with 200 diagnoses in total. As explained before, we then assigned probabilities to these diagnoses according to P T and we marked a diagnosis as the target D depending on QU . Further on, we specified the number of leading diagnoses to 9 which constitutes a trade-off between computational complexity (e.g. query generation) and amount of information taken into account for query selection. The stop condition stop_cond was reasonably defined as follows: If the currently most probable diagnosis D1 has probability at least three times as high as the second most probable diagnosis D2, i.e. p(D1) 3 p(D2), the user is asked to check if D1 corresponds to the wanted target diagnosis D . If so, D is found and the debugging session ends. Otherwise, the user

2 See http://rmbd.googlecode.com

continues the debugging session. In our implementation this ‘user check’ was simply done by testing if D1 == D . Queries were then generated in conformance with Algorithm 1, where D is the current set of leading diagnoses. In order to simulate the non-existence of certain queries according to (*), we ran through all subsets DiX 2 D and selected a DiX with probability 0:4 and for a selected DiX we ran through all partitions hDi:X ; Di;i of D n DiX and selected Di:X ; Di; with probability 0:1 to constitute a query hDiX ; Di:X ; Di;i which was then added to X. The answer to a query Xi is well-defined, if D 2 DiX (true) or D 2 Di:X (false). Otherwise, we simulated the feedback of the oracle in that we rationally generated another diagnosis probability distribution preal which can be seen as the (unknown) ‘real’ fault distribution which applies for a user. For those queries Xi where D 62 (DiX [ Di:X ), the oracle was implemented to give answers according to preal. The type of the distribution preal was determined by QU . Hence, for the good case, preal was generated in a way that it correlates positively with p, i.e. preal(Di) = ri p(Di) where ri 2 [0; 1) is a uniformly distributed random number. For the average case, preal was generated independently of p, i.e. preal(Di) = ri. For the bad case, preal was generated s.t. it correlates negatively with p, i.e. preal(Di) = ri=p(Di). Then these probabilities were normalized.

The relation between axioms in terms of common entailments was realized as follows: For each query X 2 X, we specified according to (*) that 95% of diagnoses currently not in the set of leading diagnoses D make no prediction about the query outcome, i.e. are in no case eliminated by this query. All other Di 62 D were allocated randomly to predict either true or false.

After each iteration we measured the number of queries Q# needed to find the target diagnosis D , the average elimination rate ER(%) observed for these Q# queries, and the number of queries LD # still needed to find D after it has become a leading diagnosis, i.e. D 2 D. Figure 1 illustrates the overall test results. In the bar charts, the figure shows ER(%), Q# and LD # averaged over all 35 iterations for each approach A 2 fE ; S ; Rg and each precondition in P T QU . The box plot, on the other hand, indicates the variance of the observed number of queries Q# during all 35 iterations by providing the minimal and maximal observed values, the first and third quartile and the median. The area between 0:25 and 0:75 quantiles is marked as a box which includes a horizontal line giving the position of the median. We tested our method R with many different settings for cautiousness parameters. Performance was similar for all these settings thanks to the learning algorithm. Evaluation results are given for (c; c; c) = (0:4; 0:3; 0:5) which yielded the most stable results. In Figure 1 the superior average performance of R in comparison to both other approaches S and E is clearly manifested. It can be observed that R needs fewest queries in each situation. Responsible for this is the high ER(%) and the low LD # achieved. Indeed, our method maximizes ER(%) while minimizing LD # in all cases compared to S and E . As a consequence of the high ER(%), the leading diagnoses window in our method moves on quickly to consider less probable diagnoses for discrimination. In this way, the target diagnosis gets ‘covered’ earlier by the leading diagnoses window and is thus identified with fewer queries. This high ER(%) can be explained by the property of our method to constantly monitor the elimination rate of selected queries and the ability to react quickly to poor performance by switching to a more cautious strategy. Accountable for the minimization of LD # is the learning of probabilities adopted from the entropy-based approach coupled with the high ER(%) achieved by our method. The improvement of R w.r.t. Q# is most drastically visible in the box plots which show that the interval of the most frequent 50% of observed Q# values (denoted by the box) is always located lower or as low as for the better approach of S and E . In case the lower bound of the interval is located as low as for another method, the length of the interval, i.e. the variance, is lower for our method R. 6

Conclusion

In this paper we presented a risk optimization approach to sequential debugging of ontologies. The proposed method exploits meta information in terms of fault probabilities and user-defined willingness to take risk. A substantial cost reduction compared to existing methods was shown for any quality of the given meta information. The proposed method is of particular significance for domains where only vague meta information is available, such as ontology development. 12 10 8 6 4 2 16 14 12 10 8 6 4 14 12 10 8 6 4 2 40 30 20 10

ERH%L

Qð

ERH%L E LDð

S ERH%L

R R

S Moderate Bad

R E

S Uniform Average

S Uniform Bad

1. Friedrich , G. , Shchekotykhin , K. :

A General

Diagnosis Method for Ontologies . In: Gil, Y. , Motta , E. , Benjamins , R. , Musen , M. (eds.) Proceedings of the 4 th International Semantic Web Conference (ISWC-05) . pp. 232 - 246 . Springer ( 2005 )

2. Junker , U. : QUICKXPLAIN: Preferred Explanations and Relaxations for Over-Constrained Problems . In: Association for the Advancement of Artificial Intelligence . pp. 167 - 172 . San Jose, CA, USA ( 2004 )

3. Kalyanpur , A. , Parsia , B. , Horridge , M. , Sirin , E.: Finding all Justifications of OWL DL Entailments . In: Proc. of ISWC/ASWC2007 , Busan, South Korea. LNCS , vol. 4825 , pp. 267 - 280 . Springer Verlag, Berlin, Heidelberg (Nov 2007 ), http://iswc2007.semanticweb.org/papers/267.pdf

4. Kalyanpur , A. , Parsia , B. , Sirin , E. , Cuenca-Grau , B. : Repairing Unsatisfiable Concepts in OWL Ontologies . In: 3rd European Semantic Web Conference, ESWC 2006. Lecture Notes in Computer Science , vol. 4011 , pp. 170 - 184 . Springer, Berlin, Heidelberg ( 2006 ), http://www.springerlink.com/index/10.1007/11762256

5. de Kleer, J., Williams , B.C. : Diagnosing multiple faults . Artificial Intelligence 32 ( 1 ), 97 - 130 ( Apr 1987 ), http://linkinghub.elsevier.com/retrieve/pii/0004370287900634

6. Rector , A. , Drummond , N. , Horridge , M. , Rogers , J. , Knublauch , H. , Stevens , R. , Wang , H. , Wroe , C. : OWL Pizzas: Practical Experience of Teaching OWL-DL: Common Errors & Common Patterns . In: Motta, E. , Shadbolt , N.R. , Stutt , A. , Gibbins , N. (eds.) Engineering Knowledge in the Age of the SemanticWeb 14th International Conference , EKAW 2004 . pp. 63 - 81 . Springer, Whittenbury Hall, UK ( 2004 ), http://www.springerlink.com/index/PNRG2T506C2JV3MD.pdf

7. Reiter , R.: A Theory of Diagnosis from First Principles. Artificial Intelligence 23 , 57 - 95 ( 1987 )

8. Schlobach , S. , Huang , Z. , Cornet , R. , Harmelen , F. : Debugging Incoherent Terminologies . Journal of Automated Reasoning 39 ( 3 ), 317 - 349 ( 2007 ), http://www.springerlink.com/index/10.1007/s10817-007-9076-z

9. Shchekotykhin , K. , Friedrich , G.: Query strategy for sequential ontology debugging . In: PatelSchneider, P.F. , Yue , P. , Hitzler , P. , Mika , P. , Lei , Z. , Pan , J. , Horrocks , I. , Glimm , B . (eds.) Proceedings of International Semantic Web Conference (ISWC 2010 ). Shanghai, China ( 2010 )