Balancing Brave and Cautious Query Strategies in Ontology Debugging

Balancing Brave and Cautious Query Strategies in Ontology Debugging PatrickRodler Alpen-Adria-Universität Klagenfurt

Universitätsstrasse 65-67 9020 Klagenfurt Austria

KostyantynShchekotykhin Alpen-Adria-Universität Klagenfurt

Universitätsstrasse 65-67 9020 Klagenfurt Austria

PhilippFleiss Alpen-Adria-Universität Klagenfurt

Universitätsstrasse 65-67 9020 Klagenfurt Austria

GerhardFriedrich Alpen-Adria-Universität Klagenfurt

Universitätsstrasse 65-67 9020 Klagenfurt Austria

Balancing Brave and Cautious Query Strategies in Ontology Debugging 310DFDA622D0C16BF8A2F457A2A91012 GROBID - A machine learning software for extracting information from scholarly documents

Sequential ontology debugging is aimed at the efficient discrimination between diagnoses, i.e. sets of axioms which must be altered or deleted from the ontology to restore consistency. By querying additional information the number of possible diagnoses can be gradually reduced. The selection of the best queries is crucial for minimizing diagnosis costs. If prior fault probabilities (FPs) are available, the best results are achieved by entropy based query selection. Given that FPs are only weakly justified, however, this strategy bravely suggests suboptimal queries although more cautious strategies should be followed. In such a case, it is more efficient to follow a no-risk strategy which prefers queries that eliminate 50% of diagnoses independently of any FPs. However, choosing the appropriate strategy in advance is impossible because the quality of given priors cannot be assessed before additional information is queried. We propose a method which combines advantages of both approaches. On the one hand, the method takes into account available meta information in terms of FPs and the user's confidence in these. On the other hand, the method can cope with weakly justified FPs by limiting the risk of suboptimal query selections based on the user's confidence in the FPs. The readiness to take risk is adapted depending on the outcome of previous queries. Our comprehensive evaluation shows that the proposed debugging method significantly reduces the number of queries compared to both the entropy based and the no-risk strategy for any choice of FPs.

Introduction

Support of ontology development and maintenance is an important requirement for the extensive use of Semantic Web technologies. However, the correct formulation of logical descriptions is an error-prone task even for experienced knowledge engineers. Ontology debuggers [8,3,1] assist ontology development by identifying sets of axioms (called diagnoses) that have to be modified s.t. inconsistencies or unwanted entailments are avoided. In many cases the main problem of debugging is the big number of alternative diagnoses.

To solve the problem a set of heuristics to rank diagnoses was proposed by [4]. However, this solution is not satisfactory as it cannot be guaranteed that the top-ranked

The research project is funded by grants of the Austrian Science Fund (Project V-Know, contract 19996) diagnosis is the target diagnosis and no further assistance for diagnoses discrimination is provided. Therefore, a debugging method based on active learning was proposed by [9]. The approach exploits the fact that an ontology without the axioms of one diagnosis may have different entailments than the ontology without the axioms of another diagnosis. Using these entailments the algorithm queries the user or another oracle whether a set of axioms is entailed by the target ontology or not. The answers are used to eliminate disagreeing diagnoses. The algorithm continues to query the oracle until a diagnosis (the target diagnosis) is significantly more probable than any other. To minimize the number of queries the algorithm uses some meta information, namely the fact that some errors are more common than others [6], i.e. that different language constructs have different fault probabilities. These probabilities can either be extracted from the logs of previous sessions or specified by the user expressing own beliefs. This advantage might be lost if unsuitable probabilities are provided where the target diagnosis is rather improbable. In this case the entropy-based query selection as well as active learning methods might require significantly more queries than strategies like split-in-half which behave independently of any meta information.

We present a hybrid method that outperforms both types of strategies by combining their benefits while overcoming their drawbacks. On the one hand, our method takes advantage of the given meta information as long as good performance is achieved. On the other hand, it gradually gets more independent of meta information if suboptimal behavior is measured. Moreover, our strategy takes into account a user's subjective quality estimation of the meta information. In this way a user may decide to take influence on the algorithm's behavior or let the algorithm act freely and find the best strategy on its own. To find the best strategy, our method constantly improves the quality of meta information as well as it learns to act profitably in all kinds of situations by exploiting its adaptiveness. In our comprehensive evaluation we demonstrate for various types and quality of meta information that our approach is very robust and that query costs in comparison to existing solutions are substantially minimized in all situations.

The motivation of our work as well as basic concepts are provided in Section 2. Section 3 gives the details of the suggested approach. Implementation details are discussed in Section 4 and evaluation results are described in Section 5.

Basic Concepts and Motivation

Ontology Debugging deals with the following problem:

Problem Definition 1 (Ontology Debugging) Given a diagnosis problem instance O, B, TC + , TC − where O = T ∪ A denotes an ontology, i.e. a set of terminological axioms T and a set of assertional axioms A, B denotes a set of background knowledge axioms (which are assumed to be true), TC + is a set of positive test cases, and TC − is a set of negative test cases, such that -O is inconsistent/incoherent 1and/or

-∃ t + ∈ TC + : O ∪ B |= t + and/or -∃ t − ∈ TC − : O ∪ B |= t − .

The reasonable assumption is made that the background theory is consistent with the positive and negative test cases, i.e. B ∪ ( t + ∈TC + t + ) ∪ ¬t − is consistent for all t − ∈ TC − . The task is then to find a diagnosis D ⊆ O s.t. there is a set of axioms EX s.t. 1 and 2 and 3 hold for O * := (O \ D) ∪ EX: In ontology debugging two types of operations can be distinguished: diagnosis and repair. At first, one needs to find a set of axioms, i.e. a diagnosis D, which must be deleted from the ontology O in order to restore consistency (requirements 1 and 3). This is the task of ontology diagnosis. Then it might be necessary to add appropriate axioms EX to O \ D in order to give the ontology the intended semantics/entailments (requirement 2) because by deleting axioms from O some intended entailments might be eliminated as well. This is the topic of ontology repair. In this work we focus on ontology diagnosis. For EX we use the following approximation: EX ≈ t + ∈TC + t + . Note that EX must at least contain all axioms in TC + .

1. O * ∪ B is consistent/coherent 2. O * ∪ B |= t + ∀t + ∈ TC + 3. O * ∪ B |= t − ∀t − ∈ TC − O * denotes

Ontology diagnosis can be exemplified by the following simple OWL DL ontology O ex = T ∪ A with terminology T : An algorithm for generating all possible diagnoses for a given diagnosis problem instance is presented in [1]. However, one major issue in ontology diagnosis is that there may be a huge number n of possible diagnoses D = {D 1 , . . . , D n }, depending on the size and the properties of the ontology O. This problem is also manifested in the above example where six possible diagnoses are found for an ontology consisting of merely six axioms. To tackle this problem, the authors in [9] exploit the fact that ontologies O \ D i and O \ D j have different entailments for D i , D j ∈ D (D i = D j ) in order to discriminate between potential diagnoses D. They discuss methods which gather new information by posing queries about intended entailments of the target ontology to an oracle and utilize this information to reduce the set of potential diagnoses as quickly as possible. So, the following subproblem of ontology diagnosis is addressed:

Algorithm 1: Query Generation Input: diagnosis problem instance O, B, TC + , TC − , set of diagnoses D, a reasoner R E instantiated to provide all entailments of types E ⊆ ET Output: a set of queries X 1 foreach D X ⊂ D do 2 X ← getEntailments(R E , B ∪ TC + ∪ D∈D X O \ D); 3 if X = ∅ then 4 foreach Dr ∈ D \ D X do 5 if O \ Dr ∪ B ∪ TC + |= X then D X ← D X ∪ {Dr}; 6 else if O \ Dr ∪ B ∪ TC + |= ¬X then D ¬X ← D ¬X ∪ {Dr}; 7 else D ∅ ← D ∅ ∪ {Dr}; 8 X ← X ∪ (X, D X , D ¬X , D ∅ ) 9 return X;

Problem Definition 2 (Diagnosis Discrimination) Given the set of possible diagnoses D = {D 1 , . . . , D n } w.r.t. O, B, TC + , TC − , find a sequence (X 1 , . . . , X q ) with minimal q where X i ∈ X for i = 1, . . . , q is a query to an oracle and X is the set of all possible queries , such that the set of possible diagnoses can be reduced to D = {D * } where D * ∈ {D 1 , . . . , D n } is called the target diagnosis. The order of queries is crucial and affects q.

The function of the mentioned oracle to answer queries is usually realized by the user who created the ontology or by an expert in the domain modeled by the ontology. A set of axioms X j is called a query if it is a subset of common entailments of a set of ontologies

(O \ D i ) where D i ∈ D X j ⊂ D.

In description logics there is a number of different entailment types ET . However, in general not all of these types are used to construct queries as this could produce high numbers of entailments and would make the process of query answering very time-consuming and inefficient. In [9], for example, queries are generated by considering only entailments obtained by classification and realization. A query X j means asking the oracle (O * |= X j ?) and is answered either by true, i.e. O * |= X j , or by false, i.e. O * |= X j . A query X j partitions the set of diagnoses D into D X j , D ¬X j , D ∅ j such that:

-∀ D i ∈ D X j : (O \ D i ) ∪ B ∪ ( t + ∈TC + t + ) |= X j , -∀ D i ∈ D ¬X j : (O \ D i ) ∪ B ∪ ( t + ∈TC + t + ) |= ¬X j , and -D ∅ j = D \ (D X j ∪ D ¬X j ).

A query X j is stored together with its partition as (X j , D X j , D ¬X j , D ∅ j ). Algorithm 1 shows a procedure for computing all queries X for given D. The function GETENTAIL-MENTS(R E , axioms) calls the reasoner R E to get all entailments X of type E ∈ ET of axioms and returns ∅ if axioms is inconsistent. Applied to our example ontology O ex , this algorithm returns the set of all possible queries shown in Table 1.

If a query X j is answered positively, then X j is added to the positive test cases, i.e. TC + ∪{X j }, and a diagnosis D i ∈ D is a diagnosis for O, B, TC + ∪{X j }, TC − iff (O\D i )∪B∪TC + ∪X j |= t − for all t − ∈ TC − . If a query X j is answered negatively, then X j is added to the negative test cases, i.e. TC − ∪ {X j }, and a diagnosis

D i ∈ D Query D X i D ¬X i D ∅ i X1 : {B(w)} D2,3 if getQueryAnswer (X) == true then D ← D \ D ¬X ; 4 else D ← D \ D X ; 5 until |D| == 1; 6 return D; is a diagnosis for O, B, TC + , TC − ∪ {X j } iff (O \ D i ) ∪ B ∪ TC + |= X j . So,

given that X j = true, the set of eliminated diagnoses is D ¬X j and the set of remaining diagnoses is D X j ∪ D ∅ j . Otherwise, the set of rejected diagnoses is D X j and the set of remaining diagnoses is D ¬X j ∪ D ∅ j . A generic diagnoses discrimination algorithm is described in algorithm 2. The essential part of this algorithm is the GETBESTQUERY function whose implementation determines its performance decisively. In [9], two different implementations of GETBESTQUERY are studied, namely split-in-half and entropy-based query selection.

The entropy-based approach uses meta information about probabilities p t that the user makes a fault when using a syntactical construct of type t ∈ CT where CT is the set of construct types available in the used ontology expression language. For example, ∀, ∃, , ¬, , are some OWL DL construct types. These fault probabilities p t are assumed to be independent and used to calculate fault probabilities of axioms as

p(ax k ) = 1 − t∈CT (1 − p t ) n(t)(1)

where n(t) is the number of occurrences of construct type t in ax k . These probabilities are in turn used to determine fault probabilities of diagnoses

D i ∈ D as p(D i ) = ax r ∈Di p(ax r ) ax s ∈O\Di (1 − p(ax s )).(2)

The strategy is then to select the query which minimizes the expected entropy in the diagnosis probabilities after the query. This means that uncertainty should be minimized and information gain should be maximized. According to [5] this is equivalent to choosing the query X j which minimizes the following scoring function:

sc ent (X j ) = aj ∈{true,false} p(X j = a j ) log 2 p(X j = a j ) + p(D ∅ j ) + 1(3)

This function is minimized by queries X j with p(D X j ) = p(D ¬X j ) = 0.5. So, entropybased query selection favors queries whose outcome is most uncertain. After each query X j , the new information obtained by the feedback of the oracle is incorporated in the meta information by updating the diagnosis probabilities according to the Bayesian formula:

p(D i |X j = a j ) = p(X j = a j |D i )p(D i ) p(X j = a j )(4)

where a j ∈ {true, false} and

p(X j = true) = Dr∈D X j p(D r ) + 1 2 D k ∈D ∅ j and p(X j = a j |D k ) := 1/2 for D k ∈ D ∅ j and in {0, 1} for D r ∈ D X j ∪D ¬X j

according to the explanations on page 4.

The split-in-half approach, on the other hand, acts completely independently of any meta information and selects the query X j which minimizes the following scoring function:

sc split (X j ) = |D X j | − |D ¬X j | + |D ∅ j | So,

the split-in-half strategy prefers queries which eliminate half of the diagnoses, independently of the query outcome.

The result of the evaluation in [9] is that entropy-based query selection reveals better performance than split-in-half if the given meta information is suitable. However, the question is how to choose the fault probabilities profitably in that they favor the unknown target diagnosis. By setting probabilities equally for each user, e.g. according to the results of the study conducted by [6], it cannot be taken for granted that this meta information will be correct for all users. On the contrary, since every user is prone to individual and thus generally different fault patterns, we would rather need empirical data for each individual user in order to set probabilities appropriately. But unfortunately there is neither a log file for each user which could be used as a documentation of common individual faults nor any extensive user-study which would identify user profiles and allow us to set probabilities more individually. Hence, an objective approach to assigning fault probabilities for arbitrary users will generally not bear fruit. As a consequence, one may often need to rely on a vague subjective estimation of the fault probabilities by the user which might not always conform to the de facto fault probabilities. In fact, there are situations where the split-in-half approach outperforms the entropy-based strategy, namely when the fault probabilities disfavor the target diagnosis D * , i.e. assign low probability to D * . In this case, entropy-based query selection can be mislead by this poor meta information and might prefer queries that do not favor the target diagnosis.

To illustrate this, let us assume that the user who wants to debug our example ontology O ex specifies the following fault probabilities: p ∀ = 0.48, p ∃ = 0.11, p ¬ = 0.06, p = 0.03, p = 0.03. Application of formulas ( 1) and ( 2 ex |= {B(w)}?), will be identified as the optimal query since the difference between probabilities of positive (p(X 1 = true) = 0.365) and negative (p(X 1 = false) = 0.634) outcome is less than for all other queries {X 2 , X 3 , X 4 , X 5 } (see Table 1). However, note that this query could eliminate only a single diagnosis, i.e. D 1 , if the unfavorable answer is given. Hence, we call X 1 a high-risk query. If, e.g. D 5 is assumed to be the target diagnosis, i.e. D * = D 5 , then X 1 will be answered positively and all but one diagnoses are still possible. The elimination rate of this query is therefore 1/6. The probability update given by formula (4) then yields p(D 2 ) = 0.316, p(D 3 ) = 0.133, p(D 4 ) = 0.064, p(D 5 ) = 0.031, p(D 6 ) = 0.031. As a next query, X 2 will be preferred, but will be answered unfavorably as well resulting again in the elimination of only one single diagnosis. This procedure can be further executed, i.e. by querying X 3 and X 4 , until the target diagnosis D 5 will be identified by getting the answer false to query X 5 . So, by applying sc ent all five queries are required in order to find the target diagnosis. If queries are selected by sc split , on the contrary, three queries will suffice. At first, query X 3 will be chosen because it eliminates half of all diagnoses in any case. We call such a query a no-risk query. The answer will be true and querying X 5 after X 4 = true will finally reveal the target diagnosis D * = D 5 .

This example demonstrates that the no-risk strategy sc split (three queries) is more suitable than sc ent (five queries) for fault probabilities which disfavor the target diagnosis. Let us suppose, on the other hand, that probabilities are assigned reasonably in our example, e.g. that D * = D 2 . Then it will take the entropy-based strategy only two queries (X 1 , X 2 ) to find D * while split-in-half will still require three queries (X 3 , X 2 , X 1 ). The complexity of sc ent in terms of required queries varies between O(1) in the best and O(|D|) in the worst case depending on the appropriateness of the fault probabilities. On the other hand, sc split always requires O(log 2 |D|) queries. We learn from this example that the best choice of discrimination strategy depends on the quality of the meta information. Unfortunately, however, it is impossible to assess the quality of fault probabilities before the debugging session starts since this would require us to know the target diagnosis in advance. Rather, we can exploit the additional information gathered by querying the oracle in order to estimate the quality of probabilities. Let us consider e.g. the first query X 1 selected by sc ent in our example. If this query is answered unfavorably, i.e. by true, then many more diagnoses remain than are eliminated. Since the observed outcome was less likely, the partition D X 1 with lower average diagnosis probability survives. This could be a hint that the probabilities are only weakly justified. The new strategy we present in this work incorporates exactly this information, i.e. the elimination rate achieved by a query, when choosing the successive query by adapting the maximum allowed 'query-risk'. Our method combines the advantages of both the entropy-based approach and the split-in-half approach. On the one hand, it exploits the given meta information Info if Info is of high quality, but, on the other hand, it quickly loses trust in Info and becomes more cautious if some indication is given that Info is of poor quality.

Risk Optimization Strategy for Query Selection

Our risk optimization algorithm is a dynamic learning procedure which on the one hand learns by reinforcement how to select optimal queries and on the other hand continu-ally improves the a-priori given meta information based on new knowledge obtained through queries to an oracle. The behavior of our algorithm can be co-determined by the user. Therefore, the algorithm takes into account the user's doubt about the meta information, i.e. the fault probabilities p t for t ∈ CT , by allowing them to define the initial cautiousness c of the algorithm as well as a cautiousness interval High trust in the meta information is reflected by specifying a low minimum required cautiousness c. If the user is unsure about the rationality of the fault probabilities this can be expressed by setting c to a higher value. The value assigned to c indicates the maximum admissible cautiousness of the algorithm and can be interpreted as the minimum chance a user wants to preserve of performing better than a no-risk strategy.

This additional cautiousness information guides the behavior of the algorithm when selecting queries. The relationship between cautiousness c and queries is formalized by the following definitions: Definition 1 (Cautiousness of a Query). We define the cautiousness caut(X i ) of a query X i as follows:

caut(X i ) := min |D X i |, |D ¬X i | |D| ∈   0, |D| 2 |D|  

A query X i is called braver than query X j iff caut(X i ) < caut(X j ). Otherwise X i is called more cautious than X j . A query with highest possible cautiousness is called no-risk query.

Definition 2 (Elimination Rate). Given a query X i and the corresponding answer a i ∈ {true, false}, the elimination rate e(X i , a i ) is defined as follows:

e(X i , a i ) =    |D ¬X i | |D| if a i = true |D X i | |D| if a i = false

The answer a i to a query X i is called favorable iff it maximizes the elimination rate e(X i , a i ). Otherwise a i is called unfavorable.

So, the cautiousness caut(X i ) of a query X i is exactly the minimal elimination rate, i.e. caut(X i ) = e(X i , a i ) given that a i is the unfavorable query result. Intuitively, the user-defined cautiousness c is the minimum percentage of diagnoses in D which should be eliminated by the successive query. For brave queries the interval between minimum and maximum elimination rate is larger than for cautious queries. For no-risk queries it is minimal. Definition 3 (High-Risk Query). Given a query X i and cautiousness c, then X i is called a high-risk query iff caut(X i ) < c, i.e. the cautiousness of the query is lower than the cautiousness required by the user. Otherwise X i is called non-high-risk query. By HR(X) ⊆ X we denote the set of all high-risk queries. The set of all queries X can be partitioned in high-risk queries and non-high-risk queries.

Example 1. Reconsider the example ontology O ex of Section 2 and let c = 0.25. Then query X 2 has cautiousness caut(X 2 ) = 2/6 = 0.33 ≥ 0.25 = c and is thus a non-highrisk query. Query X 1 , on the contrary, is a high-risk query because caut(X 1 ) = 1/6 = 0.17 < 0.25 = c. However, if X 1 is not asked as first but for example as third query after D 3 , D 4 , D 5 , D 6 have been eliminated by X 3 and X 2 (compare with the behavior of split-in-half for O ex in Section 2), the cautiousness caut(X 1 ) = 0.5 which means that X 1 is a non-high-risk query in this case. Algorithm 3 describes our Risk Optimization Algorithm. Its core is the implementation of the GETBESTQUERY function in Algorithm 2 which is explained in the following (with the denotation of respective method in brackets). Further details concerning Algorithm 3 are provided in Section 4.

1. (GETMINSCOREQUERY) Determine the best query X i ∈ X according to sc ent .

That is:

X i = arg min X k ∈X (sc ent (X k )).

(GETQUERYCAUTIOUSNESS) If

X i is a non-high-risk query, i.e. caut(X i ) ≥ c, select X i . In this case X i is the query with maximum information gain among all queries X and additionally guarantees the required elimination rate specified by c. 3. (GETALTERNATIVEQUERY) Otherwise, select the query X j ∈ X (X j = X i ) which has minimal score sc ent among all least cautious non-high-risk queries L. That means: X j = arg min

X k ∈L (sc ent (X k )).

where

L = {X r ∈ X \ HR(X) | ∀X t ∈ X \ HR(X) : caut(X r ) ≤ caut(X t )}.

If there is no such query X j ∈ X, then select X i from step 1. 4. Given the query outcome X s = a s , update the meta information as per steps 4a and 4b. If step 2 was successful, then X s = X i , otherwise X s = X j . (a) (UPDATECAUTIOUSNESS) Update the cautiousness c depending on the elimination rate e(X s , a s ) as follows:

c ← c + c adj(5)

where c adj denotes the cautiousness adjustment factor which is defined as follows:

c adj := adj 2 (c − c) where adj := |D| 2 − |D| − e(X s , a s )(6)

and ∈ (0, 1 2 ) an arbitrary real number. The factor 2 (c − c) in formula ( 6) simply regulates the extent of the cautiousness adjustment depending on the interval length c − c. The more crucial factor in the formula is adj which indicates whether c adj is positive or negative, i.e. whether the cautiousness is increased or decreased. Note that adj < 0 holds for any favorable query result, independently of the cautiousness caut(X s ) of the asked query X s . This has the following reasons: For even |D|, the elimination rate resulting from the favorable query outcome is ≥ |D| 2 /|D|, which is the maximal value the minimal elimination rate of a query can take (see Definition 1). But, also |D| 5) is outside the user-defined cautiousness interval [c, c], it is set to c if c < c and to c if c > c. Positive c adj is a penalty telling the algorithm to get more cautious, whereas negative c adj is a bonus resulting in a braver behavior of the algorithm. (b) (GETPROBABILITIES) Update the diagnosis probabilities p(D r ) for D r ∈ D according to formula (4) on page 6. Section 4 gives more details on this method.

To illustrate the behavior of our algorithm, we revisit the example ontology O ex with six diagnoses discussed in Section 2. Again, we first assume that D * = D 2 . Further on, let the user be quite unsure whether to trust or not trust in the fault probabilities (see page 6) and thus define cautiousness c := 0.3 and the cautiousness interval [c, c] := [0, 6 2 /6] = [0, 0.5]. Then, our algorithm will first test if the query X 1 which has minimal score sc ent is admissible w.r.t. c. Therefore, caut(X 1 ) = 1/6 = 0.17 is calculated and compared with c = 0.3. Since caut(X 1 ) < c, X 1 is denied. As a consequence, the algorithm seeks the query with minimal sc ent among all least cautious non-high-risk queries, i.e. those which guarantee elimination of a minimal number ≥ 0.3 * 6 = 2 diagnoses. The only least cautious non-high-risk query is X 2 , so it is chosen as first query and answered by false. Therefore, the set of possible diagnoses is reduced to D = {D 1 , D 2 }. Then, the only query allowing to discriminate between D 1 and D 2 , namely X 1 , is selected and D * = D 2 is identified. For D * = D 5 , the algorithm acts as follows: The first query is again X 2 which is answered by true in this case. Hence, two diagnoses D 1 and D 2 are dismissed yielding an elimination rate e(X 2 , true) = 2/6. Applying formula (6) then yields c adj = 0, i.e. c remains unchanged. Thus the next query should eliminate at least 0.3 * (6 − 2) = 2 diagnoses. The only candidate X 4 (=true) is selected, leaving open only one possible last query X 5 . So, in both situations D * = D 5 (three queries) and D * = D 2 (two queries) our method performed as well as the better of sc ent and sc split , respectively.

Implementation Details

Our Risk Optimization Algorithm (Algorithm 3) takes the following inputs: (1) A diagnosis problem instance O, B, TC + , TC − , (2) meta information Info=(P, C), comprising on the one hand fault probabilities P = {p t |t ∈ CT } where CT is the set of syntactical construct types available in the used ontology expression language, and on the other hand and user-defined cautiousness parameters C = (c, c, c), (3) the number n of considered leading diagnoses per iteration, and (4) a stopping criterion stop_cond. Input 3 is needed because, in general, it is impossible to consider all minimal diagnoses for a given diagnosis problem instance at once. A simple way of alleviating this problem is to consider only minimal diagnoses since the set of all diagnoses is sufficiently described by the set of all minimal diagnoses. This holds because any superset of a diagnosis is also a diagnosis due to the fact that eliminating an axiom from a consistent ontology cannot make the ontology inconsistent. However, this does not solve the problem as there might also be a huge number of minimal diagnoses. The problem with a large number of diagnoses is caused by the time complexity needed to find diagnoses. In order to explain this problem in more detail, we first provide a short review of the diagnosis computation method described in [1] which we employ to calculate diagnoses. This method exploits the notion of a conflict set which is defined as follows:

Definition 4 (Conflict Set). Given a diagnosis problem O, B, TC + , TC − , a conflict set CS ⊆ O is a set of axioms s.t. there is a t − ∈ TC − for which CS ∪ B ∪ TC + ∪ t − is inconsistent. A conflict set CS is called minimal iff there is no CS ⊂ CS such that CS is a conflict set.

Simply put, a conflict set CS is an inconsistent part of the ontology, i.e. CS does not meet requirements 1 or 3 in Problem Definition 1. If a diagnosis problem instance does not include any conflict set, then this problem instance is either solved, i.e. meets requirements 1, 2 and 3 in Problem Definition 1, or can be easily solved by adding the desired positive test cases TC + if requirement 2 is not met. So, the idea is to resolve every conflict set in order to solve a given diagnosis problem instance. This is achieved by deleting or changing at least one axiom of a minimal conflict set. The following proposition [7] summarizes these thoughts:

Proposition 1. D ⊆ O is a (minimal) diagnosis iff D is a (minimal) hitting set of all minimal conflict sets.

In [1], the QUICKXPLAIN algorithm (in short QX ) introduced by [2] is used to compute minimal conflict sets. QX (B, A) takes as inputs a set of trusted background knowledge axioms B and a set of axioms A and returns a minimal conflict set CS ⊆ A w.r.t.

B. If A ∪ B is consistent, QX outputs consistent. If B is inconsistent, the output is ∅.

In order to compute minimal diagnoses, a tree called HS-Tree is built with minimal conflict sets as nodes. The path from the root node nd 0 to a node nd k i on level k is denoted by HS (nd k i ). The root of this tree is a minimal conflict set obtained by calling

QX (B ∪ TC + ∪ t − , O) for t − ∈ TC − until it first outputs CS(nd 0 ) = consistent. For each axiom ax i ∈ CS(nd k−1 j ) there is a path HS (nd k i ) = HS (nd k−1 j ) ∪ {ax i } to a node nd k i .

In order to compute a minimal conflict set for node

nd k i , QX (B ∪ TC + ∪ t − , O \ HS (nd k i )) is run for t − ∈ TC − until an output CS(nd k i ) = consistent is returned. If all outputs are consistent, i.e. O \ HS (nd k i ) ∪ B ∪ TC + ∪ t − is

consistent for all t − ∈ TC − , then a minimal hitting set D = HS (nd k i ) of all minimal conflict sets, i.e. a minimal diagnosis, is found and node nd k i is not further expanded. The calculation of all minimal diagnoses at once would thus involve the complete construction of the HS-Tree. However, in the worst case this requires an exponential (in n CS ) number of calls to QX (B, A) which itself requires O(log 2 |A|) calls to an underlying reasoner where n CS denotes the total number of conflict sets. Also, generating all queries (w.r.t. certain entailment types) for a number k of diagnoses quickly becomes impracticable as k grows, since all non-trivial 2 k − 2 partitions are explored (see Algorithm 1). Therefore it is generally not feasible to examine all minimal diagnoses at the same time.

In our approach, we tackle this issue by introducing a leading diagnoses window of fixed size n, i.e. |D| = n. That is, when the debugging session starts, the first n minimal diagnoses are computed as explained before. However, instead of calculating the n minimal diagnoses by ascending cardinality as in [9], our method implements a uniform-cost search on the HS-Tree to find the n most probable minimal diagnoses. This uniform-cost search is guided by the meta information, i.e. the fault probabilities P = {p t |t ∈ CT }, provided as an input to our algorithm. From these fault probabilities, the probabilities p(ax j ) of axioms ax j ∈ O being erroneous can be determined according to formula (1). These probabilities are then used to decide which node to expand next in the search. Namely, always the node nd i ∈ V with path HS (nd i ) from the root node is expanded where p(HS (nd i )) = axj ∈HS (ndi) p(ax j ) is maximal for nd i among all nodes nd k ∈ V where V is the set of all generated nodes in the HS-Tree. In this manner, most probable diagnoses are generated first by the GETDIAGNOSES function. If g ≤ n leading diagnoses are eliminated by a query, then, in the next iteration, the function supplies the next most probable n − g diagnoses, and so on. Visually spoken, one could say that the leading diagnoses window moves on for e(X i , a i ) |D| steps after each query X i to include the next most probable e(X i , a i ) |D| diagnoses.

The GETPROBABILITIES function in each iteration calculates the diagnosis probabilities for the current set of leading diagnoses D taking into account all information gathered by querying the oracle so far. To that end it calculates the adjusted diagnosis probabilities p adj (D i ) for D i ∈ D as p adj (D i ) = (1/2) z p(D i ) where z is the number of precedent queries X k where D i ∈ D ∅ k and p(D i ) is the probability of D i calculated directly from the preliminarily given fault probabilities P by formulas (1) and (2). Afterwards the probabilities p adj (D i ) are normalized. Note that z can be computed from TC + and TC − which comprise all query answers. This way of updating probabilities is exactly in compliance with the Bayesian theorem given by formula (4). The rest of the methods used in Algorithm 3 is explained in Section 3.

Evaluation

We tested our Risk Optimization Algorithm (R) against the entropy-based approach (E ) and the split-in-half approach (S ) which were explained in Section 2. As adaptiveness is one of the central achievements of our approach, the main goal of this evaluation is to show its robustness in a variety of possible application scenarios. A major obstacle, however, is that only a small number of inconsistent/incoherent ontologies are available and they capture only a few situations, like ontology learning or tutorial cases. So, we decided to simulate diagnosis sessions covering a broader variety of situations using a model M which takes as inputs a set of total diagnoses, fault probabilities, and the parameters explained throughout the rest of this section. M is realistic because the values of parameters used in M (e.g. for query generation and diagnosis elimination) were determined by taking averages (*) after analyzing the structure and dependencies between diagnoses as well as between diagnoses and queries using real-world inconsistent/incoherent ontologies as in [9]. M is fair since each approach A ∈ {E , S , R} was tested under exactly the same conditions (i.e. diagnoses, queries, probabilities, etc.) by means of random seeds. And M enables exhaustive automated tests.

We randomly instantiated M in each test iteration featuring different types of fault probability distributions PT ∈ {extreme, moderate, uniform} as well as different cases in terms of quality of probabilities QU ∈ {good , average, bad }. Each approach A was tested for the nine different settings in P T × QU . To construct different types of fault probability distributions P T , it is sufficient to change the magnitude of the random ratio r ij = p(D i )/p(D j ) between diagnosis probabilities p(D i ) > p(D j ) under the assumption that D i is a neighbor to D j when diagnoses D i are ordered descending by p(.). This is because (i) our approach uses the fault probabilities p t only at the very beginning to calculate the diagnoses probabilities p(D i ) and subsequently only works with probabilities p(D i ), and because (ii) our implementation of the diagnosis generation supplies diagnoses in order w.r.t. their probability (see Section 5). So, extreme, moderate and uniform refer to a 'high', 'medium' and equal-to-one r ij , respectively. Concretely, an extreme distribution was instantiated by generating uniformly distributed random values r i ∈ [0,1), one for each diagnosis D i , and weighting each r i by w i = (1/1.5) i . That is, p(D i ) = r i w i . As a second step, the probabilities p(D i ) were normalized and diagnoses ordered descending by p(.). A moderate distribution was generated analogously, but with different weights w i = (1/1.1) i . To obtain a uniform distribution we simply assigned equal probabilities to all diagnoses. The quality QU of fault probabilities is measured in terms of the position of the target diagnosis D * in the list of all diagnoses ordered by descending probability. By good, average and bad quality we mean that D * is located randomly within the first, fourth and ninth tenth of this ordered list.

Our test setting, implemented in Java 2 , was the following. For each A and each precondition in P T × QU we performed 35 iterations in each of which we instantiated M with 200 diagnoses in total. As explained before, we then assigned probabilities to these diagnoses according to P T and we marked a diagnosis as the target D * depending on QU . Further on, we specified the number of leading diagnoses to 9 which constitutes a trade-off between computational complexity (e.g. query generation) and amount of information taken into account for query selection. The stop condition stop_cond was reasonably defined as follows: If the currently most probable diagnosis D 1 has probability at least three times as high as the second most probable diagnosis D 2 , i.e. p(D 1 ) ≥ 3 p(D 2 ), the user is asked to check if D 1 corresponds to the wanted target diagnosis D * . If so, D * is found and the debugging session ends. Otherwise, the user continues the debugging session. In our implementation this 'user check' was simply done by testing if D 1 == D * . Queries were then generated in conformance with Algorithm 1, where D is the current set of leading diagnoses. In order to simulate the non-existence of certain queries according to (*), we ran through all subsets D X i ∈ D and selected a D X i with probability 0.4 and for a selected D X i we ran through all partitions D ¬X i , D ∅ i of D \ D X i and selected D ¬X i , D ∅ i with probability 0.1 to constitute a query D X i , D ¬X i , D ∅ i which was then added to X. The answer to a query X i is well-defined, if D * ∈ D X i (true) or D * ∈ D ¬X i (false). Otherwise, we simulated the feedback of the oracle in that we rationally generated another diagnosis probability distribution p real which can be seen as the (unknown) 'real' fault distribution which applies for a user. For those queries X i where D * ∈ (D X i ∪ D ¬X i ), the oracle was implemented to give answers according to p real . The type of the distribution p real was determined by QU . Hence, for the good case, p real was generated in a way that it correlates positively with p, i.e. p real (D i ) = r i p(D i ) where r i ∈ [0, 1) is a uniformly distributed random number. For the average case, p real was generated independently of p, i.e. p real (D i ) = r i . For the bad case, p real was generated s.t. it correlates negatively with p, i.e. p real (D i ) = r i /p(D i ). Then these probabilities were normalized.

The relation between axioms in terms of common entailments was realized as follows: For each query X ∈ X, we specified according to (*) that 95% of diagnoses currently not in the set of leading diagnoses D make no prediction about the query outcome, i.e. are in no case eliminated by this query. All other D i ∈ D were allocated randomly to predict either true or false.

After each iteration we measured the number of queries Q# needed to find the target diagnosis D * , the average elimination rate ER(%) observed for these Q# queries, and the number of queries LD# still needed to find D * after it has become a leading diagnosis, i.e. D * ∈ D. Figure 1 illustrates the overall test results. In the bar charts, the figure shows ER(%), Q# and LD# averaged over all 35 iterations for each approach A ∈ {E , S , R} and each precondition in P T × QU . The box plot, on the other hand, indicates the variance of the observed number of queries Q# during all 35 iterations by providing the minimal and maximal observed values, the first and third quartile and the median. The area between 0.25 and 0.75 quantiles is marked as a box which includes a horizontal line giving the position of the median. We tested our method R with many different settings for cautiousness parameters. Performance was similar for all these settings thanks to the learning algorithm. Evaluation results are given for (c, c, c) = (0.4, 0.3, 0.5) which yielded the most stable results. In Figure 1 the superior average performance of R in comparison to both other approaches S and E is clearly manifested. It can be observed that R needs fewest queries in each situation. Responsible for this is the high ER(%) and the low LD# achieved. Indeed, our method maximizes ER(%) while minimizing LD# in all cases compared to S and E . As a consequence of the high ER(%), the leading diagnoses window in our method moves on quickly to consider less probable diagnoses for discrimination. In this way, the target diagnosis gets 'covered' earlier by the leading diagnoses window and is thus identified with fewer queries. This high ER(%) can be explained by the property of our method to constantly monitor the elimination rate of selected queries and the ability to react quickly to poor performance by switching to a more cautious strategy. Account-able for the minimization of LD# is the learning of probabilities adopted from the entropy-based approach coupled with the high ER(%) achieved by our method. The improvement of R w.r.t. Q# is most drastically visible in the box plots which show that the interval of the most frequent 50% of observed Q# values (denoted by the box) is always located lower or as low as for the better approach of S and E . In case the lower bound of the interval is located as low as for another method, the length of the interval, i.e. the variance, is lower for our method R.

Conclusion

In this paper we presented a risk optimization approach to sequential debugging of ontologies. The proposed method exploits meta information in terms of fault probabilities and user-defined willingness to take risk. A substantial cost reduction compared to existing methods was shown for any quality of the given meta information. The proposed method is of particular significance for domains where only vague meta information is available, such as ontology development.

ax 1 :1A ∀s.B ax 2 : B ¬∃v.¬C ax 3 : C D ¬D 1 ax 4 : D E E 1 ax 5 : E F ax 6 : F R and the assertions A : {A(w), s(w, w), v(w, w), ¬R(w)}. Let us assume that all assertions are added to the background theory, i.e. B = A, and both sets TC + and TC − are empty. Then the ontology O ex = T is inconsistent and the set of minimal diagnoses for this diagnosis problem instance T , A, ∅, ∅ is D = {D 1 : [ax 1 ], D 2 : [ax 2 ], D 3 : [ax 3 ], D 4 : [ax 4 ], D 5 : [ax 5 ], D 6 : [ax 6 ]}.

) leads to diagnoses fault probabilities p(D 1 ) = 0.634, p(D 2 ) = 0.2, p(D 3 ) = 0.084, p(D 4 ) = 0.04, p(D 5 ) = 0.02, p(D 6 ) = 0.02. Using entropy-based selection, X 1 , i.e. (O *

[c, c] where c, c, c ∈ [0, |D|/2 /|D|] and c ≤ c ≤ c. The interval [c, c] constitutes the set of all admissible cautiousness values the algorithm may take during the debugging session.

Algorithm 3 : 3 D 4 P 5 X 6 Xi33456Risk Optimization AlgorithmInput: diagnosis problem instance O, B, TC + , TC − , meta information Info=(P, C) where P = {pt|t ∈ CT } and C = (c, c, c), size of leading diagnoses window n, stop condition stop_cond Output: a diagnosis D 1 TC + ← ∅; TC − ← ∅; D ← ∅; 2 repeat ← getDiagnoses(O, B, TC + , TC − , n, D); ← getProbabilities(P, D, TC + , TC − ); ← generateQueries(O, B, TC + , D); ← getMinScoreQuery(P, X, scent); 7 if getQueryCautiousness(Xi, D) < c then Xi ← getAlternativeQuery(c, X, P, D); 8 if getAnswer(Xi) == true then TC + ← TC + ∪ {Xi}; 9 else TC − ← TC − ∪ {Xi}; 10 c ← updateCautiousness(D, TC + , TC − , Xi, c, c, c); 11 until (stop_cond ∨ getScore(Xi, scent) == 1); 12 return mostProbableDiag(D, P );

Fig. 1 .1Fig. 1. For each combination of probability distribution type in {Extreme, Moderate, Uniform} and quality in {Good, Average, Bad} one bar chart and one box plot is shown. Every plot depicts values for approaches S (split-in-half), E (entropy) and R (new method). Bar charts show average required queries (Q#), elimination rate (ER(%)) and queries needed after D * ∈ D (LD#). Box plots show the variance of Q#.

the target ontology and EX is called extension. A diagnosis D is minimal iff there is no D ⊂ D s.t. D is a diagnosis. A diagnosis D gives complete information about the correctness of each axiom ax k ∈ O, i.e. all ax i ∈ D are assumed to be faulty and all ax j ∈ O \ D are assumed to be correct.

Table 1 .1Possible queries for diagnoses Di ∈ D

D3, D4, D5, D6 D1∅X2 : {C(w)} D3, D4, D5, D6D1, D2∅X3 : {D(w)} D4, D5, D6D1, D2, D3∅X4 : {E(w)} D5, D6D1, D2, D3, D4∅X5 : {F (w)} D6D1, D2, D3, D4, D5 ∅Algorithm 2: Generic Diagnosis DiscriminationInput: diagnosis problem instance O, B, TC + , TC − , set of diagnoses D, meta information InfoOutput: target diagnosis D *1 repeat2X ← getBestQuery (D, Info) ;

Note that throughout the rest of this paper we consider debugging of inconsistent ontologies. However, the same approach can also be used to restore coherency. See http://rmbd.googlecode.com Patrick Rodler, Kostyantyn Shchekotykhin, Philipp Fleiss and Gerhard Friedrich

A General Diagnosis Method for Ontologies GFriedrich KShchekotykhin Proceedings of the 4 th International Semantic Web Conference (ISWC-05) YGil EMotta RBenjamins MMusen the 4 th International Semantic Web Conference (ISWC-05) Springer 2005 QUICKXPLAIN: Preferred Explanations and Relaxations for Over-Constrained Problems UJunker Association for the Advancement of Artificial Intelligence

San Jose, CA, USA

2004 Finding all Justifications of OWL DL Entailments AKalyanpur BParsia MHorridge ESirin Proc. of ISWC/ASWC2007 of ISWC/ASWC2007

Busan, South Korea; Berlin, Heidelberg

Springer Verlag Nov 2007 4825 Repairing Unsatisfiable Concepts in OWL Ontologies AKalyanpur BParsia ESirin BCuenca-Grau 10.1007/11762256 3rd European Semantic Web Conference, ESWC 2006 Lecture Notes in Computer Science

Berlin, Heidelberg

Springer 2006 4011 Diagnosing multiple faults JDe Kleer BCWilliams Artificial Intelligence 32 1 Apr 1987 OWL Pizzas: Practical Experience of Teaching OWL-DL: Common Errors & Common Patterns ARector NDrummond MHorridge JRogers HKnublauch RStevens HWang CWroe Engineering Knowledge in the Age of the SemanticWeb 14th International Conference, EKAW 2004 EMotta NRShadbolt AStutt NGibbins

Whittenbury Hall, UK

Springer 2004 A Theory of Diagnosis from First Principles RReiter Artificial Intelligence 23 1987 Debugging Incoherent Terminologies SSchlobach ZHuang RCornet FHarmelen 10.1007/s10817-007-9076-z Journal of Automated Reasoning 39 3 2007 Query strategy for sequential ontology debugging KShchekotykhin GFriedrich Proceedings of International Semantic Web Conference (ISWC 2010) PFPatel-Schneider PYue PHitzler PMika ZLei JPan IHorrocks BGlimm International Semantic Web Conference (ISWC 2010)

Shanghai, China

2010