RIO: Minimizing User Interaction
                 in Debugging of Aligned Ontologies ?

    Patrick Rodler, Kostyantyn Shchekotykhin, Philipp Fleiss, and Gerhard Friedrich

                        Alpen-Adria Universität, Klagenfurt, 9020 Austria
                              firstname.lastname@aau.at


        Abstract. Efficient ontology debugging is a cornerstone for many activities in
        the context of the Semantic Web, especially when automatic tools produce (parts
        of) ontologies such as in the field of ontology matching. The best currently known
        interactive debugging systems rely upon meta information in terms of fault prob-
        abilities, which can speed up the debugging procedure in the good case, but can
        also have negative impact in the bad case. Unfortunately, assessment of meta
        information is only possible a-posteriori. Hence, as long as the actual fault is un-
        known, there is always some risk of suboptimal interactive diagnoses discrimina-
        tion. As an alternative, one might prefer to rely on a no-risk strategy. In this case,
        however, possibly well-chosen meta information cannot be exploited, resulting
        again in inefficient debugging actions. In this work we present a reinforcement
        learning strategy that continuously adapts its behavior depending on the perfor-
        mance achieved and minimizes the risk of using low-quality meta information.
        Therefore, this method is suitable for application scenarios where reliable a-priori
        fault estimates are difficult to obtain. Using faulty ontologies produced by ontol-
        ogy matchers, we show that the proposed strategy outperforms both active learn-
        ing and no-risk approaches on average w.r.t. required amount of user interaction.


1     Introduction
The foundation for widespread adoption of Semantic Web technologies is a broad com-
munity of ontology developers which is not restricted to experienced knowledge engi-
neers. Instead, domain experts from diverse fields should be able to create ontologies
incorporating their knowledge as autonomously as possible. The resulting ontologies
are required to fulfill some minimal quality criteria, usually consistency, coherency
and no undesired entailments, in order to grant successful deployment. However, the
correct formulation of logical descriptions in ontologies is an error-prone task which
accounts for a need for assistance in ontology development in terms of ontology debug-
ging tools. Things get even worse when independent standalone ontologies describing
related domains are unified to a single ontology (called aligned ontology) by adding a
set of suitable correspondences (called alignment) between signature elements of the
different ontologies. This task is addressed in the field of ontology matching where
researchers aim to produce automated tools for the generation of correspondences. Ap-
plying such tools, however, often results in inconsistent/incoherent aligned ontologies
even though input ontologies (considered separately) do not violate any quality crite-
ria. Moreover, these aligned ontologies may exhibit a very complex fault structure as a
consequence of (1) adding many links between the single ontologies at once and since
?
    This research is funded by Austrian Science Fund (Project V-Know, contract 19996).
(2) the actual fault may be located in the produced alignment and/or in one or more
of the single ontologies, e.g. if a correct correpondence between two concepts “acti-
vates” a fault in one of the single ontologies. In this vein, many different sources of
inconsistency/incoherence could arise (due to (1)) each of which may comprise parts of
each single ontology as well as of the alignment (due to (2)). In this work we present
an interactive approach dealing with the general problem of locating a fault throughout
the entire aligned ontology, not just in the alignment, as addressed by state-of-the-art
systems in ontology matching such as CODI [10] or LogMap [5]. Comparison of our
method with these systems is inappropriate since they use greedy diagnosis techniques
(e.g. [8]), whereas our approach is complete.
    Usually, ontology debugging tools [3, 4] use model-based diagnosis [11] to identify
sets of faulty axioms, called diagnoses, that need to be modified or deleted in order to
meet the imposed quality requirements. The major challenge inherent in the debugging
task is often a substantial number of alternative diagnoses. This problem has been ad-
dressed in [12] by proposing an active learning debugging method which queries the
user (e.g. a domain expert) for additional information about the intended ontology.
    In a debugging scenario involving a faulty ontology developed by one user, the
meta information might be extracted from the logs of previous sessions, if available, or
specified by the user based on their experience w.r.t. own faults. However, in scenarios
involving automatized systems producing (parts of) ontologies as in ontology matching,
the choice of reasonable meta information is rather unclear. If, on the one hand, an active
learning method is used relying on a guess of the meta information, this might result in
an overhead w.r.t. user interaction of more than 2000%. If one wants to play it safe, on
the other hand, by deciding not to exploit any meta information at all, this might also
result in substantial extra time and effort for the user. So, in the light of current state-
of-the-art one is spoilt for choice between debugging strategies with high potential but
also high risk, or methods with no risk but also no potential.
    In this work we present an ontology debugging approach with high potential and
low risk, which allows to minimize user interaction throughout a debugging session
on average, without depending on high-quality meta information. By virtue of its rein-
forcement learning capability, our approach is optimally suited for debugging aligned
ontologies, where only vague or no meta information is available. On the one hand, our
method takes advantage of the given meta information as long as good performance
is achieved. On the other hand, it gradually gets more independent of meta informa-
tion if suboptimal behavior is measured. The method constantly improves the quality of
meta information and adapts a risk parameter based on the new information obtained by
querying the user. This means that, in case of good meta information, the performance
of our method will be close to the performance of the active learning method, whereas,
in case of bad meta information, the achieved performance will approach the perfor-
mance of the risk-free strategy. So, our approach can be seen as a risk optimization
strategy (RIO) which combines the benefits of active learning and risk-free strategies.
Experiments on two datasets of faulty ontologies produced by ontology matching sys-
tems show the feasibility, efficiency and scalability of RIO. The evaluation of these
experiments will manifest that, on average, RIO is the best choice of strategy for both
good and bad meta information with savings in terms of user interaction of up to 80%.
    The problem specification, basic concepts and a motivating example are provided in
Section 2. Section 3 explains the suggested approach and gives implementation details.
Evaluation results are described in Section 4. Section 5 concludes.
2      Basic Concepts and Motivation

Ontology debugging deals with the following problem: Given is an ontology O which
does not meet postulated requirements R.1 O is a set of axioms formulated in some
monotonic knowledge representation language, e.g. OWL. The task is to find one of
generally many alternative subsets of axioms in O, called target diagnosis Dt ∈ O, that
needs to be altered or eliminated from the ontology such that the resulting ontology
meets the given requirements and has the intended semantics. The general debugging
setting we consider also envisions the opportunity for the user to specify some back-
ground knowledge B, i.e. a set of axioms which are known to be correct. Moreover, we
allow definition of a set P of positive (entailed) and a set N of negative (non-entailed)
test cases, where each test case is a set of axioms.
    More formally, ontology debugging can be defined in terms of conditions a target
ontology must fulfill, which leads to the definition of a diagnosis problem instance, for
which we search for solutions, i.e. diagnoses:
Definition 1 (Target Ontology, Diagnosis Problem Instance). Let O = (T , A) de-
note an ontology consisting of a set of terminological axioms T and a set of assertional
axioms A, P a set of positive test cases, N a set of negative test cases, B a set of
background knowledge axioms, and R a set of requirements to an ontology. Then an
ontology Ot is called target ontology iff all the following conditions are fulfilled:
                                   ∀ r ∈ R : Ot ∪ B fulfills r
                                  ∀ p ∈ P : Ot ∪ B |= p
                                  ∀ n ∈ N : Ot ∪ B 6|= n
                                                                             S
The tuple hO, B, P , N iR is called a diagnosis problem instance iff B ∪ ( p∈P p) 6|= n
for all n ∈ N and O is not a target ontology, i.e. O violates at least one of the conditions
above.

Definition 2 (Diagnosis). We call D ⊆ O a diagnosis w.r.t. a diagnosis problem in-
stance hO, B, P , N iR iff there exists a set of axioms EXD such that (O \ D) ∪ EXD
is a target ontology. A diagnosis D is minimal iff there is no D0 ⊂ D such that D0 is
a diagnosis. A diagnosis D gives complete information about the correctness of each
axiom axk ∈ O, i.e. all axi ∈ D are assumed to be faulty and all axj ∈ O \ D are
assumed to be correct. The set of all minimal diagnoses is denoted by D.

The identification of an extension EXD , accomplished e.g. by some learning approach,
is a crucial part of the ontology repair process. However, the formulation of a complete
extension is outside the scope of this work where we focus on computing S     diagnoses.
Following the approach suggested in [12], we approximate EXD by the set p∈P p.
Example: Consider the OWL ontology O encompassing the following terminology T :

     ax 1 : P hD v Researcher         ax 4 : Student v ¬DeptM ember
     ax 2 : Researcher v DeptEmployee ax 5 : P hDStudent v P hD
     ax 3 : P hDStudent v Student     ax 6 : DeptEmployee v DeptM ember
 1
     Throughout the paper we consider debugging of inconsistent and/or incoherent ontologies, i.e.
     whenever not stated explicitly we assume R = {consistency, coherency}.
and an assertional axiom A = {P hDStudent(s)}. Then O is inconsistent since it
describes a PhD student as both a department member and not.
    Let us assume that the assertion P hDStudent(s) is considered as correct and is
thus added to the background theory, i.e. B = A, and both sets P and N are empty.
Then, the set of minimal diagnoses D = {D1 : [ax 1 ], D2 : [ax 2 ], D3 : [ax 3 ], D4 :
[ax 4 ], D5 : [ax 5 ], D6 : [ax 6 ]} for the given problem instance hT , A, ∅, ∅i. D can be
computed by a diagnosis algorithm such as the one presented in [3].
    With six diagnoses for six ontology axioms, this example might already give an idea
that in many cases the number of diagnoses D can get very large. Without any prior
knowledge, each of the diagnoses in D is equally likely to be the target diagnosis Dt ,
that is selected by a user in order to formulate the intended ontology Ot := (O \ Dt ) ∪
EXDt . Identification of Dt can be accomplished by means of queries [12]. Thereby, the
fact is exploited that ontologies O \ Di and O \ Dj resulting in application of different
diagnoses Di , Dj ∈ D (Di 6= Dj ) entail different sets of logical axioms.
Definition 3 (Query). A set of logical axioms Xj is called a query iff there exists a set
of diagnoses ∅ ⊂ D0 ⊂ D such                                               ∗
                               Sthat Xj is entailed by each ontology in {Oi | Di ∈ D }
                                                                                       0

where Oi∗ := (O \ Di ) ∪ B ∪ p∈P p. Asking a query Xj to a user means asking them
(Ot |= Xj ?). The set of all queries w.r.t. D is denoted by XD .2
                                                                       ∅
Each query Xj partitions the set of diagnoses D into hDP          N                    P
                                                            j , Dj , Dj i such that Dj =
         ∗                           ∗                               ∅
{Di | Oi |= Xj }, Dj = {Di | Oi ∪ Xj is inconsistent} and Dj = D \ (Dj ∪ DN
                       N                                                         P
                                                                                        j ).
If the answering of queries by a user u is modeled as a function au : X → {yes, no},
then the following holds: If au (Xj ) = yes, then Xj is added to the positive test cases,
i.e. P ← P ∪{Xj }, and all diagnoses in DN  j are rejected. Given that au (Xj ) = no, then
N ← N ∪ {Xj } and all diagnoses in DP    j  are rejected. For the example ontology O, we
could, e.g., ask the user the query X1 := (Ot |= {DeptEmployee(s), Student(s)}?)
                                              ∅
with the associated partition hDP       N
                                   1 , D1 , D1 i = h{D4 , D6 }, {D1 , D2 , D3 , D5 }, ∅i. A
negative answer would then eliminate {D4 , D6 }.
Definition 4 (Diagnosis Discrimination). Given the set of diagnoses D = {D1 , . . . , Dn }
w.r.t. hO, B, P , N iR and a user u, find a sequence (X1 , . . . , Xq ) of queries Xi ∈ X
with minimal q, such that D = {Dt } after assigning Xi(i=1...,q) each to either P iff
au (Xi ) = yes or N iff au (Xi ) = no.3
A set of queries for a given set of diagnoses D can be generated as shown in Algo-
rithm 1. In each iteration, for a set of diagnoses DP ⊂ D, the generator gets a set of
logical descriptions X that are entailed by each ontology Oi∗ where Di ∈ DP (func-
tion GET E NTAILMENTS). When we speak of entailments, we always address the output
computed by the classification and realization services of a reasoner [1, p.323 ff.]. These
axioms X are then used to classify the remaining diagnoses in D\DP in order to obtain
the partition hDP , DN , D∅ i associated with X. Then, together with its partition, X is
added to the set of queries X. Note that in real-world applications, investigation of all
possible subsets of the set D might be infeasible. Thus, it is common to approximate
 2
   For the sake of simplicity, we will use X instead of XD throughout this work because the D
   associated with X will be clear from the context.
 3
   Since the user u is assumed fixed throughout a debugging session and for brevity, we will use
   ai equivalent to au (Xi ) in the rest of this work.
 Algorithm 1: Query Generation
     Input: diagnosis problem instance hO, B, P, N i, set of diagnoses D
     Output: a set of queries and associated partitions X
 1   foreach DP ⊂ D do
 2        X ← getEntailments(O, B, P, DP );
 3        if X 6= ∅ then
 4              foreach Dr ∈ D \ DP do
 5                   if Or∗ |= X then DP ← DP ∪ {Dr };
 6                   else if Or∗ ∪ X is inconsistent then DN ← DN ∪ {Dr };
 7                   else D∅ ← D∅ ∪ {Dr };
                             D                     E
 8              X ← X ∪ X, DP , DN , D∅

 9 return X;


the set of all minimal diagnoses by a set of leading diagnoses. This set comprises a
predefined number n of minimal diagnoses.
     The query generation algorithm returns a set of queries X that generally contains a
lot of elements. Therefore the authors in [12] suggested two query selection strategies.
Split-in-half strategy, selects the query Xj which minimizes the scoring function
                                         ∅
scsplit (Xj ) = |DP            N
                       j | − |Dj | + |Dj |, i.e. this strategy prefers queries which elimi-
nate half of the diagnoses independently of the query outcome.
Entropy-based strategy, uses information about prior probabilities for a user to make
a fault in an axiom [12]. The fault probabilities of axioms p(ax i ) can in turn be used
to determine fault probabilities of diagnoses Di ∈ D. The strategy is then to select
the query which minimizes the expected entropy of the set of leading diagnoses D
after the query is answered. This means that the expected uncertainty is minimized
and the expected information gain is maximized. According to [7], this is equiva-
lent to choosing the query Xj which minimizes the scoring function scent (Xj ) =
                                        ∅
P
   aj ∈{yes,no} p(aj ) log2 p(aj ) + p(Dj ) + 1. After each query Xj , the diagnosis proba-
bilities are updated according to the Bayesian formula [12].
     A diagnosis discrimination procedure can use either of the strategies to identify
the target diagnosis Dt . The result of the evaluation in [12] shows that entropy-based
query selection reveals better performance than split-in-half in most of the cases. How-
ever, split-in-half proved to be the best strategy in situations when only vague priors
are provided, i.e. the target diagnosis Dt has rather low prior fault probability. There-
fore selection of prior fault probabilities is crucial for successful query selection and
minimization of user interaction.
Example (continued): In our example, if the user specifies p(ax i (i=1,...,4) ) = 0.001,
p(ax 5 ) = 0.1 and p(ax 6 ) = 0.15. Given Dt := D2 , the no-risk strategy scsplit (three
queries) is more suitable than scent (four queries) because the fault probabilities disfa-
vor D2 . If Dt := D6 , then the entropy-based strategy requires only two queries while it
takes split-in-half three queries due to favorable fault probabilities.
     We learn from this example that the best choice of discrimination strategy depends
on the quality of the meta information in terms of prior fault probabilities. In cases
where adequate meta information is not available and hard to estimate, e.g. Ontology
Matching, the inappropriate choice of strategy might cause tremendous extra effort for
the user interacting with the debugging system.
3   Risk Optimization Strategy for Query Selection
The proposed Risk Optimization Algorithm (RIO) extends entropy-based query selec-
tion strategy with a dynamic learning procedure that learns by reinforcement how to
select optimal queries. Moreover, it continually improves the prior fault probabilities
based on new knowledge obtained through queries to a user. The behavior of our al-
gorithm can be co-determined by the user. The algorithm takes into account the user’s
doubt about the priors in terms of the initial cautiousness c as well as the cautiousness
interval [c, c] where c, c, c ∈ [cmin , cmax ] := [0, b|D|/2c /|D|], c ≤ c ≤ c and D con-
tains at most n leading diagnoses (see Section 2). The interval [c, c] constitutes the set
of all admissible cautiousness values the algorithm may take during the debugging ses-
sion. High trust in the prior fault probabilities is reflected by specifying a low minimum
required cautiousness c and/or a low maximum admissible cautiousness c. If the user is
unsure about the rationality of the priors this can be expressed by setting c and/or c to
a higher value. Intuitively, c − cmin and cmax − c represent the minimal desired differ-
ence in performance to a high-risk (entropy) and no-risk (split-in-half) query selection,
respectively.
    The relationship between cautiousness c and queries is formalized by the following
definitions:
Definition 5 (Cautiousness of a Query). We define the cautiousness caut(Xi ) of a
query Xi as follows:                                j |D| k 
                               min |DPi |, |D N
                                              i |        2
                  caut(Xi ) :=                    ∈ 0,       
                                     |D|                |D|
A query Xi is called braver than query Xj iff caut(Xi ) < caut(Xj ). Otherwise Xi
is called more cautious than Xj . A query with highest possible cautiousness is called
no-risk query.
Definition 6 (Elimination Rate). Given a query Xi and the corresponding answer
ai ∈ {yes, no}, the elimination rate e(Xi , ai ) is defined as follows:
                                            |DN |
                                            |D|i
                                                     if ai = yes
                             e(Xi , ai ) =      P
                                            |Di |
                                              |D|    if ai = no
The answer ai to a query Xi is called favorable iff it maximizes the elimination rate
e(Xi , ai ). Otherwise ai is called unfavorable.
So, the cautiousness caut(Xi ) of a query Xi is exactly the minimal elimination rate, i.e.
caut(Xi ) = e(Xi , ai ) given that ai is the unfavorable query result. Intuitively, the user-
defined cautiousness c is the minimum proportion of diagnoses in D which should be
eliminated by the successive query. For braver queries the interval between minimum
and maximum elimination rate is larger than for more cautious queries. For no-risk
queries it is minimal.
Definition 7 (High-Risk Query). Given a query Xi and cautiousness c, then Xi is
called a high-risk query iff caut(Xi ) < c, i.e. the cautiousness of the query is lower
than the algorithm’s current cautiousness value c. Otherwise, Xi is called non-high-
risk query. By HR c (X) ⊆ X we denote the set of all high-risk queries w.r.t. c. For
given cautiousness c, the set of all queries X can be partitioned in high-risk queries
and non-high-risk queries.
    Given a user’s answer as to a query Xs , the cautiousness c is updated depending on
the elimination rate e(Xs , as ) by c ← c + cadj where cadj := 2 (c − c)adj denotes
the cautiousness adjustment factor. The factor 2 (c − c) is a scaling factor that simply
regulates the extent of the cautiousness adjustment depending on the interval length c −
c. The more crucial factor in the formula is adj which indicates the sign and magnitude
of the cautiousness adjustment.      j        k
                                       |D|
                                        2 −
                             adj :=             − e(Xs , as )
                                         |D|

where  ∈ (0, 12 ) is a constant which prevents the algorithm from getting stuck in a
no-risk strategy for even |D|. E.g., given c = 0.5 and  = 0, the elimination rate of a
no-risk query e(Xs , as ) = 12 resulting always in adj = 0. The value of  can be set to
an arbitrary real number, e.g.  := 41 . If c + cadj is outside the user-defined cautiousness
interval [c, c], it is set to c if c < c and to c if c > c. Positive cadj is a penalty telling the
algorithm to get more cautious, whereas negative cadj is a bonus resulting in a braver
behavior of the algorithm.
     The RIO algorithm, described in Algorithm 2, starts with the computation of mini-
mal diagnoses. GET D IAGNOSES function implements a combination of hitting-set (HS-
Tree) [11] and QuickXPlain [6] algorithms as suggested in [12]. Using uniform cost
search, the algorithm extends the set of leading diagnoses D with a maximum number
of most probable minimal diagnoses such that |D| ≤ n.
     Then the GET P ROBABILITIES function calculates the fault probabilities p(Di ) for
each diagnosis Di of the set of leading diagnoses D according to [12]. In order to take
into account all information gathered by querying an oracle so far the algorithm adjusts
fault probabilities p(Di ) as follows: padj (Di ) = (1/2)z p(Di ), where z is the number
of precedent queries Xk for which Di ∈ D∅k . Afterwards the probabilities padj (Di )
are normalized. Note that z can be computed from P and N which comprise all query
answers. This way of updating probabilities is exactly in compliance with the Bayes
Formula [12]. Based on the set of leading diagnoses D, GENERATE Q UERIES generates
all queries according to Algorithm 1. GET M IN S CORE Q UERY determines the best query
Xsc ∈ X according to scent . That is, Xsc = arg minXk ∈X (scent (Xk )). If Xsc is a non-
high-risk query, i.e. c ≤ caut(Xsc ) (determined by GET Q UERY C AUTIOUSNESS), Xsc
is selected. In this case, Xsc is the query with maximum information gain among all
queries X and additionally guarantees the required elimination rate specified by c.
     Otherwise, GETA LTERNATIVE Q UERY selects the query Xalt ∈ X (Xalt 6= Xsc )
which has minimal score scent among all least cautious non-high-risk queries Lc . That
is, Xalt = arg minXk ∈Lc (scent (Xk )) where Lc = {Xr ∈ X \ HR c (X) | ∀Xt ∈
X \ HR c (X) : caut(Xr ) ≤ caut(Xt )}. If there is no such query Xalt ∈ X, then Xsc is
selected. Given the positive answer of the oracle, the selected query Xs ∈ {Xsc , Xalt }
is added to the set of positive test cases P or, otherwise, to the set of negative test cases
N . In the last step of the main loop the algorithm updates the cautiousness value c
(function UPDATE C AUTIOUSNESS) as described above.
     Before the next query selection iteration starts, a stop condition test is performed.
The algorithm evaluates whether the most probable diagnosis is at least σ% more likely
than the second most probable diagnosis (ABOVE T HRESHOLD) or none of the leading
diagnoses has been eliminated by the previous query, i.e.GET E LIMINATION R ATE re-
turns zero for Xs . In case that one of the stop conditions is fulfilled, the presently most
likely diagnosis is returned (MOST P ROBABLE D IAG).
4      Evaluation
The main points we want to show in this evaluation are: On the one hand, independently
of the specified meta information, RIO exhibits superior average behavior compared to
entropy-based method and split-in-half w.r.t. the amount of user interaction required.
On the other hand, we want to demonstrate that RIO scales well and that the reaction
time measured is well suited for an interactive debugging approach.
    As data source for the evaluation we used problematic real-world ontologies pro-
duced by ontology matching systems.4 This has the following reasons: (1) Matching
results often cause inconsistency and/or incoherency of ontologies. (2) The (fault) struc-
ture of different ontologies obtained through matching generally varies due to different
authors and matching systems involved in the genesis of these ontologies. (3) For the
same reasons, it is hard to estimate the quality of fault probabilities, i.e. it is unclear
which of the existing query selection strategies to chose for best performance. (4) Avail-
able reference mappings can be used as correct solutions of the debugging procedure.
    Matching of two ontologies Oi and Oj is understood as detection of correspon-
dences between matchable elements of these ontologies. An ontology matching oper-
ation determines an alignment Mij , which is a set of correspondences, i.e. tuples of
the form hxi , xj , r, vi, where xi ∈ Q(Oi ), xj ∈ Q(Oj ), Q(O) is the set of matchable
elements of an ontology O, r is a semantic relation and v ∈ [0, 1] is a confidence value.
We call OiM j := Oi ∪ Mij ∪ Oj the aligned ontology for Oi and Oj . In our approach
the elements of Q(O) are restricted to atomic concepts and roles and r ∈ {v, w, ≡}
under the natural alignment semantics [8]
Example (continued): Imagine that our example ontology O evolved from matching
two standalone ontologies O1 := {ax 1 , ax 2 } and O2 := {ax 3 , ax 4 } resulting in the
alignment M12 = {ax 5 , ax 6 }. If we recall the set of diagnoses for O consisting of
all single axioms in O, we realize that the fault we are trying to find may be located
either in O1 or in O2 or in M12 . Existing approaches to alignment debugging usually
consider only the produced alignment as problem source. Our approach, on the contrary,
is designed to cope with the most general setting: Any subset S ⊆ O1M 2 of axioms
of the aligned ontology can be analyzed for faults whereas O1M 2 \ S can be added
to the background axioms B, if known to be correct. In this way, the search space for
 4
     Thanks to Christian Meilicke for the supply of the test cases used in the evaluation.


     Algorithm 2: Risk Optimization Algorithm (RIO)
      Input: diagnosis problem instance hO, B, P, N i, fault probabilities of diagnoses DP , cautiousness
              C = (c, c, c), number of leading diagnoses n to be considered, acceptance threshold σ
      Output: a diagnosis D
 1    P ← ∅; N ← ∅; D ← ∅;
 2    repeat
 3          D ← getDiagnoses(D, n, O, B, P, N );
 4          DP ← getProbabilities(DP, D, P, N );
 5          X ← generateQueries(O, B, P, D);
 6          Xs ← getMinScoreQuery(DP, X);
 7          if getQueryCautiousness(Xs , D) < c then Xs ← getAlternativeQuery(c, X, DP, D);
 8          if getAnswer(Xs ) = yes then P ← P ∪ {Xs };
 9          else N ← N ∪ {Xs };
10          c ← updateCautiousness(D, P, N , Xs , c, c, c);
11    until (aboveThreshold(DP, σ) ∨ eliminationRate(Xs ) = 0);
12    return mostProbableDiag(D, DP );
diagnoses can be restricted elegantly depending on the prior knowledge about Dt , which
can greatly reduce the complexity of the underlying diagnosis problem.
     In [13] it was shown that existing debugging approaches suffer from serious prob-
lems w.r.t. both scalability and correctness of results when tested on a dataset of in-
coherent aligned OWL ontologies. Since RIO is an interactive ontology debugging
approach able to query and incorporate additional information into its computations,
it can cope with cases unsolved in [13]. In order to provide evidence for this and to
show the feasibility of RIO – simultaneously to the main goals of this evaluation –
we decided to use a superset of the dataset5 used in [13] for our tests. Each incoher-
ent aligned ontology OiM j in the dataset is the result of applying one of the ontology
matching systems COMA++, Falcon-AO, HMatch or OWL-CTXmatch to a set of six
ontologies Ont = {CRS, PCS, CMT, CONFTOOL, SIGKDD, EKAW} in the domain
of conference organization. For a given pair of ontologies Oi 6= Oj ∈ Ont, each system
produced an alignment Mij . On the basis of a manually produced reference alignment
Rij ⊆ Mij for ontologies Oi , Oj (cf. [9]), we were able to fix a target diagnosis Dt for
each incoherent OiM j . In cases where Rij suggested a non-minimal diagnosis, we de-
fined Dt as the minimum cardinality diagnosis which was a subset of Mij \ Rij . In one
single case, Rij proved to be incoherent because an obviously valid correspondence
Reviewer1 ≡ reviewer2 turned out to be incorrect. We re-evaluated this ontology and
specified a coherent Rij . Yet this makes evident that, in general, people are not capable
of analyzing alignments without adequate tool support.
     In our experiments we set the prior fault probabilities as follows: p(axk ) := 0.001
for ax k ∈ Oi ∪ Oj and p(ax m ) := 1 − vm for ax m ∈ Mij , where vm is the confidence
of the correspondence underlying ax m . Note that this choice results in a significant
bias towards diagnoses which include axioms from Mij . Based on these settings, in
the first experiment (EXP-1), we simulated an interactive debugging session employing
split-in-half (SPL), entropy (ENT) and RIO algorithms, respectively, for each ontology
OiM j . Throughout all experiments, we performed module extraction before each test
run, which is a standard preprocessing method for ontology debugging approaches.
All tests were executed on a Core-i7 (3930K) 3.2Ghz, 32GB RAM and with Ubuntu
Server 11.04 and Java 6 installed. The user-chosen parameters were set as follows:
|D| := 9 (proved to be a good trade-off between computational complexity for query
generation and approximation of all minimal diagnoses), σ := 85%, c := 0.25 and
[c, c] := [cmin , cmax ] = [0, 49 ]. For the tests we considered the most general setting, i.e.
Dt ⊂ OiM j . So, we did not restrict the search for Dt to Mij only, simulating the case
where the user has no idea whether any of the input ontologies Oi , Oj or the alignment
Mij or a combination thereof is faulty. In each test run we measured the number of
required queries until Dt was identified. In order to simulate the case where the fault
includes at least one axiom ax ∈ OiM j \ Mij , we implemented a second test session
with altered Dt . In this experiment (EXP-2), we precalculated a maximum of 30 most
probable minimal diagnoses, and from these we selected the diagnosis with the highest
number of axioms ax k ∈ OiM j \ Mij as Dt in order to simulate more unsuitable meta
information. All the other settings were left unchanged. The queries generated in the
tests were answered by an automatic oracle by means of the target ontology OiM j \ Dt .
The average metrics for the set of aligned ontologies OiM j per matching system were
as follows: 312 ≤ |OiM j | ≤ 377 and 19.1 ≤ |Mij | ≤ 28.4.

 5
     http://code.google.com/p/rmbd/downloads
                                                            %


                EXP-1 EXP-2 EXP-3 EXP-4


                                                 overhead
     qSPL < qENT 11% 37%     0% 29%
     qENT < qSPL 82% 56% 100% 71%
     qSPL = qENT 7%    7%    0%    0%
     qRIO < min 4% 26% 29% 71%
     qRIO ≤ min 74% 74% 100% 100%

                         (a)                                         (b)
Fig. 1. (a) Percentage rates indicating which strategy performed best/better w.r.t. the required
user interaction, i.e. number of queries. EXP-1 and EXP-2 involved 27, EXP-3 and EXP-4 seven
debugging sessions each. qstr denotes the number of queries needed by strategy str and min is an
abbreviation for min(qSPL , qENT ). (b) Box-Whisker Plots presenting the distribution of overhead
(qw − qb )/qb ∗ 100 (in %) per debugging session of the worse strategy qw := max(qSPL , qENT )
compared to the better strategy qb := min(qSPL , qENT ). Mean values are depicted by a cross.

    In order to analyze the scalability of RIO, we used the set of ontologies from the
ANATOMY track in the Ontology Alignment Evaluation Initiative6 (OAEI) 2011.5,
which comprises two input ontologies O1 (Human, 11545 axioms) and O2 (Mouse,
4838 axioms). The size of the alignments generated by 12 different matching systems
was between 1147 and 1461 correspondences. Note that the aligned ontologies output
by five matching systems, i.e. CODI, CSA, MaasMtch, MapEVO and Aroma, could
not be analyzed in the experiments. This was due to a consistent output produced by
CODI and the problem that the reasoner was not able to find a model within acceptable
time (2 hours) in the case of CSA, MaasMtch, MapEVO and Aroma. Similar reasoning
problems were also reported in [2]. Given the ontologies O1 and O2 , the output M12 of a
matching system, and the correct reference alignment R12 , we first fixed Dt as follows:
Both ontologies O1 and O2 as well as the correctly extracted alignments M12 ∩ R12
were placed in the background knowledge B. The incorrect correspondences M12 \ R12
were analyzed by the debugger. In this way, we identified a set of diagnoses, where each
diagnosis is a subset of M12 \R12 . From this set of diagnoses, we randomly selected one
diagnosis as Dt . Then we started the actual experiments: In EXP-3,7 in order to simulate
reasonable prior fault probabilities, a debugging session with parameter settings as in
                                                                           Printed by Mathematica for Students


EXP-1 was executed. In EXP-4, we altered the settings in that we specified p(ax k ) :=
0.01 for ax k ∈ Oi ∪ Oj and p(ax m ) := 0.001 for ax m ∈ Mij , which caused the target
diagnosis, that consisted solely of axioms in Mij , to get assigned a relatively low prior
fault probability.
    Results of both experimental sessions, hEXP-1,EXP-2i and hEXP-3,EXP-4i, are
summarized in Figure 2(a) and Figure 2(b), respectively. For the ontologies produced
by each of the matching systems and for the different experimental scenarios, the figures
show the (average) number of queries asked by RIO and the (average) differences to the
number of queries needed by the per-session better and worse strategy of SPL and ENT,
respectively. The results illustrate clearly that the average performance achieved by RIO
was always substantially closer to the better than to the worse strategy. In both EXP-1
and EXP-2, throughout 74% of 27 debugging sessions, RIO worked as efficiently as
the best strategy (Figure 1(a)). In more than 25% of the cases in EXP-2, RIO even
 6
     http://oaei.ontologymatching.org
 7
     For all details w.r.t. hEXP-3,EXP-4i, see http://code.google.com/p/rmbd/wiki/ OntologyAlign-
     mentAnatomy.
outperformed both other strategies; in these cases, RIO could save more than 20% of
user interaction on average compared to the best other strategy. In one scenario involv-
ing OWL-CTXmatch in EXP-1, it took ENT 31 and SPL 13 queries to finish, whereas
RIO required only 6 queries, which amounts to an improvement of more than 80% and
53%, respectively. In hEXP-3,EXP-4i, the savings achieved by RIO were even more
substantial. RIO manifested superior behavior to both other strategies in 29% and 71%
of cases, respectively. Not less remarkable, in 100% of the tests in EXP-3 and EXP-4,
RIO was at least as efficient as the best other strategy. Table 1, which provides the av-
erage number of queries per strategy, demonstrates that, overall, RIO is the best choice
in all experiments. Consequently, RIO is suitable for both good meta information as
in EXP-1 and EXP-3, where Dt has high probability, and poor meta information as in
EXP-2 and EXP-4, where Dt is a-priori less likely. Additionally, Table 1 illustrates the
(average) overall debugging time assuming that queries are answered instantaneously
and the reaction time, i.e. the average time between two successive queries. Also w.r.t.
these aspects, RIO manifested good performance. Since the times consumed by either
of the strategies in hEXP-1,EXP-2i are almost negligible, consider the more meaningful
results obtained in hEXP-3,EXP-4i. While the best reaction time in both experiments
was achieved by SPL, we can clearly see that SPL was significantly inferior to both
ENT and RIO concerning the user interaction required and the overall time. RIO re-
vealed the best debugging time in EXP-4, and needed only 2.2% more time than the
best strategy (ENT) in EXP-3. However, if we assume the user being capable of read-
ing and answering a query in, e.g., half a minute on average, which is already quite fast,
then the overall time savings of RIO compared to ENT in EXP-3 would already account
for 5%. Doing the same thought experiment for EXP-4, using RIO instead of ENT and
SPL would save 25% and 50% of debugging time on average, respectively. All in all,
the measured times confirm that RIO is well suited as interactive debugging method.
     For SPL and ENT strategies, the difference w.r.t. the number of queries per test run
between the better and the worse strategy was absolutely significant, with a maximum
of 2300% in EXP-4 and averages of 190% to 1145% throughout all four experiments,
measured on the basis of the better strategy (Figure 1(b)). Moreover, results show that
the different quality of the prior fault probabilities in {EXP-1,EXP-3} compared to
{EXP-2,EXP-4} clearly affected the performance of the ENT and SPL strategies (see
first two rows in Figure 1(a)). This perfectly motivates the application of RIO.

                                                                     EXP-3    EXP-4
                         EXP-1    EXP-2                      45,00
    12,00                                                    40,00
    10,00                                                    35,00
                                                             30,00
     8,00
                                                             25,00
     6,00                                                q
q                                                            20,00
     4,00                                                    15,00
                                                             10,00
     2,00
                                                              5,00
     0,00                                                     0,00
            HMatch   Falcon-AO   OWL-Ctxmatch   COMA++


                          (a)                                           (b)
Fig. 2. The bars show the avg. number of queries (q) needed by RIO, grouped by matching tools.
The distance from the bar to the lower (upper) end of the whisker indicates the avg. difference of
RIO to the queries needed by the per-session better (worse) strategy of SPL and ENT, respectively.
                EXP-1               EXP-2              EXP-3                EXP-4
           debug react q debug react q debug react q debug react q
      ENT 1860 262 3.67 1423 204 5.26 60928 12367 5.86 74463 5629 11.86
      SPL 1427 159 5.70 1237 148 5.44 104910 4786 19.43 98647 4781 18.29
      RIO 1592 286 3.00 1749 245 4.37 62289 12825 5.43 66895 8327 8.14
Table 1. Average time (ms) for the entire debugging session (debug), average time (ms) between
two successive queries (react), and average number of queries (q) required by each strategy.

5    Conclusion
We have shown problems of state-of-the-art interactive ontology debugging strategies
w.r.t. the usage of unreliable meta information. To tackle this issue, we proposed a learn-
ing strategy which combines the benefits of existing approaches, i.e. high potential and
low risk. Depending on the performance of the diagnosis discrimination actions, the
trust in the a-priori information is adapted. Tested under various conditions, our algo-
rithm revealed an average performance superior to two common approaches in the field
w.r.t. required user interaction. In our evaluation we showed the utility of our approach
in the important area of ontology matching, its scalability and adequate reaction time
allowing for continuous interactivity.

References
 1. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P. (eds.): The De-
    scription Logic Handbook: Theory, Implementation, Applications. Cambridge Press (2003)
 2. Ferrara, A., Hage, W.R.V., Hollink, L., Nikolov, A., Shvaiko, P.: Final results of the Ontology
    Alignment Evaluation Initiative 2011. In: Evaluation (2011)
 3. Friedrich, G., Shchekotykhin, K.: A General Diagnosis Method for Ontologies. In: Gil, Y.,
    Motta, E., Benjamins, R., Musen, M. (eds.) The Semantic Web - ISWC 2005, 4th Interna-
    tional Semantic Web Conference. pp. 232–246. Springer (2005)
 4. Horridge, M., Parsia, B., Sattler, U.: Laconic and Precise Justifications in OWL. Proc of the
    7th International Semantic Web Conference ISWC 2008 5318, 323–338 (2008)
 5. Jiménez-Ruiz, E., Grau, B.C.: Logmap: Logic-based and scalable ontology matching. In:
    The Semantic Web - ISWC 2011. pp. 273–288. Springer (2011)
 6. Junker, U.: QUICKXPLAIN: Preferred Explanations and Relaxations for Over-Constrained
    Problems. In: Proceedings of the 19th National Conference on AI, 16th Conference on Inno-
    vative Applications of AI. vol. 3, pp. 167–172. AAAI Press / The MIT Press (2004)
 7. de Kleer, J., Williams, B.C.: Diagnosing multiple faults. Artif. Intell. 32(1), 97–130 (1987)
 8. Meilicke, C., Stuckenschmidt, H.: An Efficient Method for Computing Alignment Diag-
    noses. In: Proceedings of the 3rd International Conference on Web Reasoning and Rule Sys-
    tems. pp. 182–196. Springer-Verlag (2009)
 9. Meilicke, C., Stuckenschmidt, H., Tamilin, A.: Reasoning Support for Mapping Revision.
    Journal of Logic and Computation 19(5), 807–829 (2008)
10. Noessner, J., Niepert, M., Meilicke, C., Stuckenschmidt, H.: Leveraging Terminological
    Structure for Object Reconciliation. The Semantic Web: Research and Applications pp. 334–
    348 (2010)
11. Reiter, R.: A Theory of Diagnosis from First Principles. Artif. Intell. 32(1), 57–95 (1987)
12. Shchekotykhin, K., Friedrich, G., Fleiss, P., Rodler, P.: Interactive ontology debugging : two
    query strategies for efficient fault localization. Web Semantics: Science, Services and Agents
    on the World Wide Web 12-13, 88–103 (2012)
13. Stuckenschmidt, H.: Debugging OWL Ontologies - A Reality Check. In: Proceedings of the
    6th International Workshop on Evaluation of Ontology-based Tools and the Semantic Web
    Service Challenge (EON). pp. 1–12. Tenerife, Spain (2008)