RIO: Minimizing User Interaction in Ontology Debugging Patrick Rodler, Kostyantyn Shchekotykhin, Philipp Fleiss, and Gerhard Friedrich Alpen-Adria Universität, Klagenfurt, 9020 Austria firstname.lastname@aau.at Abstract. Interactive ontology debugging incorporates a user who answers queries about entailments of their intended ontology. In order to minimize the amount of user interaction in a debugging session, a user must choose an appropriate query selection strategy. However, the choice of an unsuitable strategy may result in tremendous overhead in terms of time and cost. We present a learning method for query selection which unites the advantages of existing approaches while over- coming their flaws. Our tests show the utility of our approach when applied to a large set of real-world ontologies, its scalability and adequate reaction time allowing for continuous interactivity. Motivation and Concept: Efficient ontology debugging is a cornerstone for many activities in the context of the Semantic Web, especially when automatic tools produce (parts of) ontologies such as in the field of ontology matching. Ontology matching aims at generating a set of se- mantic links, called alignment, between elements of two standalone ontologies describ- ing related domains. The two ontologies together with the produced alignment, called the aligned ontology, may exhibit a very complex fault structure as a consequence of (1) adding many links between the single ontologies at once and since (2) the actual fault may be located in the produced alignment and/or in one or both of the single ontologies, e.g. if a correct link between two concepts “activates” a source of inconsis- tency in one of the single ontologies. This makes evident that adequate tool assistance in debugging of such ontologies is indispensable. Ontology debugging deals with the following problem: Given an ontology O which does not meet postulated requirements R,1 the task is to find a subset of axioms in O, called diagnosis, that needs to be altered or eliminated from the ontology in order to meet the given requirements. Generally, there are many alternative diagnoses for one and the same faulty ontology O. The problem is then to figure out the single diagnosis, called target diagnosis Dt , that enables to formulate the target ontology Ot featuring the user-intended semantics in terms of entailments and non-entailments. The target ontol- ogy can be understood as O minus the axioms of Dt plus additional axioms EXDt which can be added in order to regain desired entailments which might have been elim- inated together with axioms in Dt . In interactive ontology debugging we assume a user, e.g. the author of the faulty ontology or a domain expert, interacting with an ontology debugging system by an- swering queries about entailments of the desired target ontology Ot . Roughly speak- ing, each query is a set of axioms and the user is queried whether the conjunction of these axioms is entailed by Ot . Every positively (negatively) answered query consti- tutes a positive/entailed (negative/non-entailed) test case fulfilled by Ot . Test cases can 1 Throughout this work R = {consistency, coherency}. be seen as constraints Ot must satisfy and are therefore used to gradually reduce the search space for valid diagnoses. Simply put, the overall procedure consists of (1) com- puting a predefined fixed number of diagnoses (the set of leading diagnoses D, usually |D| ≈ 10) as an approximation of all diagnoses, (2) gathering additional information by querying the user, i.e. adding a positive or negative test case, (3) incorporating this information to cut irrelevant areas off the search space, i.e. eliminating diagnoses not complying with the newly specified test case. This loop is continued until the search space is reduced to a single (target) diagnosis Dt . The goal is to achieve this with a minimal number of queries to the user. The best currently known interactive debugging systems pursue active learning strategies for query generation exploiting meta information in terms of fault probabil- ities of the user who formulates the ontology. Such a system is described in [1] where fault probabilities are used to calculate for each diagnosis the probability of being the target diagnosis. At each step, the query is selected which minimizes the expected en- tropy of the set of leading diagnoses D after the query is answered. This means that the expected uncertainty is minimized and the expected information gain is maximized. This entropy-based strategy (ENT) can speed up the debugging procedure if probabil- ities are specified appropriately, but can also have substantial negative impact on the performance in case of unreasonable probabilities. The problem is that assessment of probabilities is only possible a-posteriori. Consequently, as long as the actual fault is unknown, there is always some risk of suboptimal query selection. As an alternative, one might prefer to rely on an approach with constant perfor- mance which pursues a no-risk strategy without taking into account any meta informa- tion. One such strategy is split-in-half (SPL) [1], which selects the query which elimi- nates half of the leading diagnoses, independent of the answer to the query. In this case, however, possibly well-chosen fault probabilities cannot be exploited, resulting again in inefficient debugging actions. To sum up, the user may choose between a strategy with high potential and high risk and a strategy with no risk and no potential. Therefore, we introduce a method with high potential and low risk, which can be seen as a hybrid risk optimization method (RIO) exploiting positive aspects of both ENT and SPL. On the one hand, our method takes advantage of the given probabilities as long as good performance is achieved. On the other hand, it gradually gets more inde- pendent of meta information if suboptimal behavior is measured. This is accomplished by constantly adapting a reinforcement learning parameter c ∈ [0, 0.5], which can be seen as the minimal postulated "cautiousness" of the next selected query. The cautious- ness of a query is equivalent to its worst case elimination rate w.r.t. the set of leading diagnoses D. E.g., if |D| = 10 and a query Q1 eliminates 1 (9) leading diagnoses for 1 positive (negative) answer, the cautiousness of Q1 is 10 , whereas a query Q2 with 5 (5) 5 has an elimination rate of 10 . W.r.t., e.g. c = 0.3, Q1 would be a high-risk-query since it eliminates less than 0.3 ∗ 100% of leading diagnoses in the worst case. Thus, it would be dismissed by RIO as a candidate for the next query. By contrast, Q2 is a non-high- risk-query as it eliminates more diagnoses than claimed by c anyway. Actually, Q2 is even a no-risk-query because it eliminates 50% of diagnoses in D in any case. Given a query Q3 with 7 (3), we call Q3 more cautious than Q1 and less cautious than Q2 . More concretely, RIO works as follows: Select the same query QENT as ENT would select, if the cautiousness of QENT is greater or equal c. Otherwise, select the query with best entropy-measure among all (if more than one) least cautious non-high-risk-queries. In the (rare) situation that no such query exists, select QENT . E.g., let current c = 0.3 and QENT = Q1 , then RIO would select Q3 since it has cautiousness 0.3 = c and is thus the only least cautious non-high-risk-query. After each answered query, the new information is taken into account by updating the diagnosis probabilities according to the Bayesian rule [1]. Additionally, the cautiousness parameter c ← c + a is adjusted by a value a which is proportional to 0.5 minus the actual achieved elimination rate of the current query. So, for a an elimination of more than half of the leading diagnoses, RIO gets a bonus (a < 0) allowing it to take more risk in the next iteration. Otherwise, a penalty is imposed implying more cautious successive behavior. The following evaluation will demonstrate that, independently of the quality of specified meta information, RIO exhibits superior average performance compared to ENT and SPL w.r.t. the amount of user interaction required. Furthermore, experiments will show that RIO scales well and that the reaction time measured is well suited for an interactive debugging approach. % EXP-1 EXP-2 EXP-3 EXP-4 overhead qSPL < qENT 11% 37% 0% 29% qENT < qSPL 81% 56% 100% 71% qSPL = qENT 7% 7% 0% 0% qRIO < min 4% 26% 29% 71% qRIO ≤ min 74% 74% 100% 100% (a) (b) Fig. 1. (a) Percentage rates indicating which strategy performed better w.r.t. number of queries. qstr denotes the number of queries needed by strategy str and min is an abbreviation for min(qSPL , qENT ). (b) Box-Whisker Plots presenting the distribution of overhead (qw −qb )/qb ∗100 (in %) per debugging session of the worse strategy qw := max(qSPL , qENT ) compared to the better strategy qb := min(qSPL , qENT ). EXP-3 EXP-4 EXP-1 EXP-2 45,00 12,00 40,00 10,00 35,00 30,00 8,00 25,00 6,00 q q 20,00 4,00 15,00 10,00 2,00 5,00 0,00 0,00 HMatch Falcon-AO OWL-Ctxmatch COMA++ (a) (b) Fig. 2. The bars show the avg. number of queries (q) needed by RIO, grouped by matching tools. The distance from the bar to the lower (upper) end of the whisker indicates the avg. difference between q and the queries needed by the per-session better (worse) strategy qb (qw ). Notation as in Figure 1. Evaluation: We performed four experiments EXP-i(i = 1, . . . , 4) where we applied RIO to a set of incoherent ontologies produced by automatic ontology matchers.2 As data source for EXP-1 and EXP-2 we used a superset of the dataset3 used in [2] where it was shown Printed by Mathematica for Students 2 For details and further results, see http://code.google.com/p/rmbd/wiki/OntologyAlignmentAnatomy 3 http://code.google.com/p/rmbd/downloads that existing debugging approaches suffer from serious problems w.r.t. both scalability and correctness of results when tested on this dataset. As an interactive approach able to query and incorporate additional information into its computations, RIO can cope with cases unsolved in [2]. For the scalability tests in EXP-3 and EXP-4, we used the set of ontologies from the ANATOMY track in the Ontology Alignment Evaluation Initiative (OAEI) 2011.5, which comprises two input ontologies O1 (Human, 11545 axioms) and O2 (Mouse, 4838 axioms). The faulty aligned ontologies had up to 17844 axioms. Available reference alignments enabled to fix a target diagnosis Dt for each incoher- ent ontology. Throughout all experiments, unlike state-of-the-art alignment debuggers, we considered the most general problem where the search for the target diagnosis is not restricted to the alignment. In each test run we measured the number of required queries until Dt was identified. All tests were executed on a Core-i7 (3930K) 3.2Ghz, 32GB RAM and with Ubuntu Server 11.04 and Java 6 installed. In EXP-1/EXP-3 fault probabilities were chosen reasonably (good case), whereas in EXP-2/EXP-4, they were specified in a way Dt got very improbable (bad case). Queries were answered by an automatic oracle by means of the target ontology obtained through Dt . Results of hEXP-1,EXP-2i and hEXP-3,EXP-4i, are summarized in Figure 2(a) and Figure 2(b), respectively. The results illustrate clearly that avg. performance achieved by RIO was always substantially closer to the better than to the worse strategy. In both EXP-1 and EXP-2, throughout 74% of 27 debugging sessions, RIO worked as effi- ciently as the best strategy (Figure 1(a)). In more than 25% of the cases in EXP-2, RIO even outperformed both other strategies; in these cases, RIO could save more than 20% of user interaction on average compared to the best other strategy. In one sce- nario in EXP-1, it took ENT 31 and SPL 13 queries to finish, whereas RIO required only 6 queries (improvement of > 80% and 53%, respectively). In hEXP-3,EXP-4i, the savings achieved by RIO were even more substantial (superior behavior to both other strategies in 29% and 71% of cases, respectively). Not less remarkable, in 100% of the tests in EXP-3 and EXP-4, RIO was at least as efficient as the best other strategy. Concerning average number of queries per strategy, RIO is the best choice in all exper- iments. Consequently, RIO is suitable for both good meta information (EXP-1/EXP-3) and poor meta information (EXP-2/EXP-4). Moreover, assuming a user being capa- ble of reading and answering a query in, e.g., half a minute on average, RIO shows best perfromance w.r.t. overall debugging time with savings of up to 50% compared to ENT/SPL. Reaction time of RIO, i.e. avg. time between two successive queries, was always < 12.9s. For SPL and ENT strategies, the difference w.r.t. the number of queries per test run between the better and the worse strategy was absolutely significant, with a maximum of 2300% in EXP-4 and averages of 190% to 1145% throughout all four experiments (Figure 1(b)). Moreover, results show that the different quality of probabil- ities in {EXP-1,EXP-3} versus {EXP-2,EXP-4} clearly affected performance of ENT and SPL strategies (Figure 1(a)). This perfectly motivates the application of RIO. References 1. Shchekotykhin, K., Friedrich, G., Fleiss, P., Rodler, P.: Interactive ontology debugging : two query strategies for efficient fault localization. Web Semantics: Science, Services and Agents on the World Wide Web 12-13, 88–103 (2012) 2. Stuckenschmidt, H.: Debugging OWL Ontologies - A Reality Check. In: Proceedings of the 6th International Workshop on Evaluation of Ontology-based Tools and the Semantic Web Service Challenge (EON). pp. 1–12. Tenerife, Spain (2008)