=Paper=
{{Paper
|id=Vol-1387/paper3
|storemode=property
|title=Verifying Reasoner Correctness - A Justication Based Method
|pdfUrl=https://ceur-ws.org/Vol-1387/paper_3.pdf
|volume=Vol-1387
|dblpUrl=https://dblp.org/rec/conf/ore/LeeMSP15
}}
==Verifying Reasoner Correctness - A Justication Based Method==
Verifying Reasoner Correctness - A Justification Based Method Michael Lee, Nico Matentzoglu, Bijan Parsia, and Uli Sattler The University of Manchester Oxford Road, Manchester, M13 9PL, UK {michael.lee-5,nicolas.matentzoglu,bijan.parsia,uli.sattler}@manchester. ac.uk Abstract. DL reasoners are complex pieces of software that work on even more complex input which makes manual verification difficult. A single ontology can have hundreds or thousands of classes and thus its classification involve an unsurveyable number of subsumption tests. We propose a new method for debugging classification across multiple rea- soners which employs justifications generated from the set of entailments that reasoners disagree upon to determine the cause of the disagreement. Keywords: OWL, reasoning, debugging, justifications 1 Introduction One of the major advantages of description logic reasoning are the sound and complete decision procedures for the known fragments that underpin OWL 2 and its various profiles. However, implementing these procedures can be very difficult due to inter-related optimisations and language specific limitations. It is very likely that no implementation is actually correct for all legal inputs. We have found that, in practice, reasoners do disagree on particular entailments in the inferred class hierarchy. The complexity of the implementation makes a formal verification of correctness, or even a generation of a non-arbitrary set of automated unit tests, a near impossibility. Thus, it is difficult to resolve disputes. Majority voting (MV) is a resolution method used in reasoner competitions [4]. When a disagreement occurs over the inferred class hierarchy, the verdict of the majority of reasoners is taken as truth. In case of a tie, the correct reasoner is selected at random. This method has the key problem that it is not inherently sensitive to the truth of the matter. We present a novel semi-automated method to determine reasoner correctness and narrow down potential causes of disagreement amongst a set of dissenting reasoners in an efficient manner. We have evaluated our method using a corpus of BioPortal ontologies. 2 The Method Throughout, we denote R a description logic reasoner and E(O, R) the set of atomic subsumptions found by R (the class hierarchy). J is a justification for 2 Michael Lee, Nico Matentzoglu, Bijan Parsia, and Uli Sattler O |= η if J ⊆ O, J |= η and there is no J 0 ⊂ J such that J 0 |= η. We call a tuple < O, η, J , Rjust , R > a case, where Rjust is the reasoner that generated the justification J for the entailment η and R is the reasoner that verified it. Aside from these notions we assume a basic familiarity with description logics, OWL and reasoners (see [1]). The method: First, for a given ontology O, we ask reasoners R1 , ..., Rm , denoted R, to compute the E(O, Ri ) for all Ri ∈ R. From these, we compute the set of disagreed upon entailments, denoted D, by takingSthe union ofTall inferred class hierarchies and removing the intersection of them, E(O, Ri )\ E(O, Ri ). Fix some n ∈ N, as the upper bound on the number of generated justifica- tions. For each given entailment η ∈ D and every Ri ∈ R, we generate a sequence of up to n many justifications J1 , J2 , ...Jn . Note that we attempt to generate justifications with all reasoners, including the ones for which η 6∈ E(O, Ri ). For each justification Ji for η and each reasoner Ri ∈ R1 we check whether η follows directly from the given justification (justification verification). We then classify all cases according to one of four categories. (1) If Ri infers Ji |= η and also initially considered η ∈ E(O, Ri ), then we classify this behaviour as consistent yes. (2) If Ri infers Ji |= η and did not compute η as part of the generated class hierarchy, then we consider Ri to have a simple bug. This could happen because of a bug in the classification traversal algorithm or because of faulty caching. (3) If Ri infers Ji 6|= η and η 6∈ E(O, Ri ), then we class these as consistent no. We assume that either Ri or the reasoner that generated the justification contains a bug. These cannot be automatically decided and we pass the justification to an expert. If the expert determines Ji to be a justification for η, than Ri is considered buggy, else the justification generating reasoner. (4) If Ri infers Ji 6|= η and initially returned η ∈ E(O, Ri ), then we consider this a serious error and class this as possible bug, as we are uncertain where a problem has occurred. Such cases also require expert review. In our experiment, this case did not occur. 3 Experimental Design Corpus, Reasoners and Machines: For the experiment the OWL API (v. 3.5.0) implementations [6] of four state of the art reasoners were used: FaCT++ 1.6.3 [9], JFact 1.2.3, Pellet 2.3.1 [7] and HermiT 1.3.8 [3]. Aside from the fact that these are popular, FaCT++ and JFact were also picked because of their similarity, in an attempt to see if that similarity would produce any change to the MV and the results from the method. Moreover, the number of reasoners were set to 4 to provide enough variation between reasoners, but also to include the possibility of inducing a draw of two teams between them. Our corpus is a snapshot of 339 BioPortal ontologies (January 2015). All experiments are per- formed on a desktop computer running 4 i5-3479 CPU at 3.20 GHz with a 64 bit Ubuntu 14.04. Identifying potential ontologies: From our corpus, we filtered out those ontologies that were merely RDFS or AL and the ones that do not fall under 1 For reliability, we let reasoners verify their own justifications. Verifying Reasoner Correctness - A Justification Based Method 3 OWL 2 DL. For the remaining 240, we computed the inferred class hierarchy (1 hour timeout) for all four reasoners. We excluded a further 80 ontologies, for which there was a timeout. The inferred hierarchies were normalised using default techniques for dealing with owl:Nothing, owl:Thing, atomic subsump- tions and equivalent classes. For the 160 ontologies that remained, the reasoners agreed in 155 cases and disagreed in 5. Identifying and classifying problem cases: For the remaining 5 ontologies, we generate and verify justifications according to the method described above. We used the OWL Explanation Framework by Horridge [5]2 to generate explana- tions for each η ∈ D. For performance reasons, we generate only one justification per reasoner per entailment. Each explanation is stored in .owl format allowing us to reload them to check them against the reasoners and if needed, perform direct evaluation. For direct evaluation cases, we generate a human readable version of the justification. Testing hypotheses and generating patches: We test four hypotheses that emerged from initial observations we will describe in more depth later: (1) The reasoner swallows asserted axioms, (2) the reasoner does not correctly imple- ment the data type map (3) interdependent justifications can be reduced to one or more root justifications of an entailment that can be fixed by asserting the entailment and (4) we can use the intersection of explanations to generate ”patches“ that work around a large number of bugs and point to the erroneous behaviour. 4 Results The 5 ontologies in our in-depth experiment were the Biological Collections On- tology (BCO), the Gene Regulation Ontology (GRO), the Sysmo Jerm Ontology of Systems Biology for Micro Organism (JERM), the Clusters of Orthologous Groups Cog Analysis Ontology (CAO) and the Cell Culture Ontology (CCONT). Across these 5 ontologies there was one instance of total disagreement in the Class Hierarchy and 4 partial majorities. With respect to the total disagreement, it was found that 3 out of the 4 reasoners verified a justification for an entailment missing from their class hierarchy. HermiT, at least for our small set, did not exhibit any buggy behaviour. Table 1 shows for each ontology and reasoner R and ontology O: – Was the reasoner part of the majority (MAJ) according to the MV method? – Did total disagreement (tie on Class Hierarchy level) occur? (DIS) – If there was an unclear entailment in D (possible bug/consistent no) involving R, was the reasoner generating the justification (TIE (J )) or verifying it (TIE (R))? – Was there one or more case for R which implicated a bug (BUG)? – Does the verdict differ from simple majority voting (dff2MV)? 2 https://github.com/matthewhorridge/owlexplanation 4 Michael Lee, Nico Matentzoglu, Bijan Parsia, and Uli Sattler Table 1. Reasoner verdicts O R MAJ DIS TIE (J ) TIE (R) BUG diff2MV BCO jfact 0 0 1 0 0 diff BCO fact 1 0 1 0 0 same BCO pellet 0 0 0 1 0 diff BCO hermit 1 0 1 0 0 same GRO jfact 0 0 1 0 1 same GRO fact 0 0 1 0 1 same GRO pellet 1 0 0 1 0 same GRO hermit 1 0 0 1 0 same JERM jfact 1 0 0 0 0 same JERM fact 0 0 0 0 1 same JERM pellet 1 0 0 0 0 same JERM hermit 1 0 0 0 0 same CAO jfact 0 1 0 0 1 same CAO fact 0 1 0 0 1 same CAO pellet 0 1 0 0 1 same CAO hermit 0 1 0 0 0 diff CCONT jfact 1 0 0 0 0 same CCONT fact 0 0 0 0 1 same CCONT pellet 1 0 0 0 0 same CCONT hermit 1 0 0 0 0 same As we can see, BCO produced at least one unclear case. According to MV, FaCT and HermiT’s inferred hierarchy would be in the majority. However Pel- let produced unclear consistent no cases, which meant that we needed further analysis to determine which entailment was correct. Expert review showed that Pellet in fact failed to entail something correctly, as detailed in Section 4.1. In the case of the CAO ontology (total disagreement), we could identify clear bug cases for all reasoners except HermiT. A MV based technique would chose the victor randomly, which disadvantages HermiT. Reasoning over the other three ontologies each produced a majority that we could successfully provide further confidence for. For each, any dissenting reasoner was found to have produced a bug case and hence missed an entailment in its overall class hierarchy. 4.1 Analysis of Errors and Method for Patching Of the 5336 cases generated by our method, 3718 were classed as consistent yes, 1590 were classed as bug and 28 as consistent no. A further diagnostic investigation was carried out on the 1618 error cases. This diagnosis found four different types of problem for the reasoners. These are, (1) removal of asserted axioms from the ontology by FaCT++ and JFact, (2) a difficulty for FaCT++ and JFact to resolve type information, (3) Pellet being unable to determine a particular subsumption relation and (4) a case of Pellet “missing” a set of strong structurally similar justifications. For each case we used a different method to provide a diagnosis of the problem or a minimal patch to the ontology that the Verifying Reasoner Correctness - A Justification Based Method 5 developer could then use to test against. This was a prudent measure, given the large number of cases. These techniques are naturally dependent on the nature of the errors produced. Ideally, we would like to provide developers a tuple of ontology, entailment, justification and patch that illustrate the faulty behaviour. Axiom swallowing: FaCT++, JFact In analysis of those cases classed as bug, it was discovered that 116 of the justifications were of cardinality 1. Flagging for entailments being contained in the ontology showed that the subsumption axioms were directly asserted in the ontologies. All such cases were instances of FaCT++ or JFact verifying a justification for an entailment missing in it’s inferred class hierarchy. Specifically 100 of these cases were produced by FaCT, 16 by JFact. All JFact cases were “replicated” by FaCT++. These 116 cases are 28 particular axioms being missed by FaCT++ and JFact, with JFact missing 7. This suggests a pre-processing bug on the part of FaCT++ and JFact. Of the bug cases, 1440 cases (approximately 90 %) rely upon these 28 axioms. This suggests that the vast majority of our bug cases are down to this error. Further support for this hypothesis comes from analysis of the overall cases. 400 out of the total 5336 cases had justifications of size 1. 4640 require these 400 cases to hold in the sample overall. This suggests that most of work was generated by FaCT++ and JFact failing to entail the result. Data Type errors: FaCT++, JFact From the consistent no cases, we found two justifications produced by FaCT and JFact that asserted a particular class was unsatisfiable. Direct examination showed that the class Decrease was clas- sified as unsatisfiable because of the specifications of the data type. Decrease had for its data property polarity a specified datatype of rdf:PlainLiteral. This was distinct from the specified range of the polarity datatype, which were all xs:string. The actual possible value was correct, but the types differed. Ac- cording to W3C specifications on plain literals, such a substitution is allowed: rdf plain literals should be interpreted as xs:strings [10]. Consequently, “negative” cast as plainLiteral would be accepted as a string. This suggested that JFact and FaCT++ had problems with particular data types. We verified this fact with a pair of minimal test cases. Wrongly Missing Entailment: Pellet The 24 additional cases classed as consistent no were all generated by Pellet asserting that a particular entailment did not follow from a given justification. These cases could be collapsed into 8 unique justifications (in-between reasoner redundancy). Interestingly, each of these justifications took on a similar structural form. All 8 justifications were verified as correct for their entailment (expert re- view). They had structural similarities and upon examination we discovered them to have class names from each entailment also occurring in the axioms of justifications for all other entailments. Because of this interaction, we found that inserting one of these missing entailments removed the disagreements and forced Pellet to produce the same Class Hierarchy as the other reasoners. In this 6 Michael Lee, Nico Matentzoglu, Bijan Parsia, and Uli Sattler case inserting IAO 109 v BF O 40 into the ontology allowed Pellet to infer the correct Class Hierarchy and eliminate all problematic cases from the ontology. This was as precise as we were able to point to the problematic area. Note that 0 0 this produces an O = O ∪ {α} on which all reasoners agree, but where O is not necessarily equivalent to O. Classification bugs: Pellet All bug cases generated by Pellet were found to have justifications of size 5 and to have entailments of the form A v CAO 323 (the RHS is a named class in the CAO). These similarities made us suspect that the justifications might be structurally similar and possibly share axioms. Checking the intersection of all the justifications produced three shared ax- ioms, Inverse(CAO 52, CAO 59), CAO 55 v ∃CAO 59.CAO 323, CAO 323 ≡ CAO 48 u ∀CAO 52.CAO 171. We suspected that one of these axioms caused Pellet the difficulty. We sin- gled out the problematic axiom withT the following procedure: for a given set of justifications J1 , ...Jn take S = i Ji . For each α ∈ S and for each reasoner 0 Rj infer the class hierarchy from O := O \ α. If no disagreements on the Class Hierarchy occur then stop. We call the removal of α a patch for the problem case. The problematic axiom was singled out as CAO 55 v ∃CAO 59.CAO 323. This provided a minimal axiom that a developer can use to understand the 0 problematic reasoner behaviour. Note that as above, it might be that O 6≡ O. 5 Discussion One important consequence of our justification based method is its implications for majority voting. In two of the cases we found reasons to suspect that the majority (or lack of) might lead to wrong reasoners being classified as “correct”. Moreover, we could produce information to justify the choices picked through our method and those cases where our method agreed with MV. The evidence shows a clear difference between the method stipulated and MV. A general shortcoming of this system is that it does not catch all errors generated by the reasoners in particular those where the reasoners are all in agreement (which is the majority of the time). A more artificial problem is our restriction to a single justification per reasoner entailment pair. This problem is avoidable with greater computational resources. To our knowledge, there is little similar work. The JustBench benchmarking methodology [2], verifies the justifications being used as benchmarks by cross checking them against all reasoners (this also emerges naturally when bench- marking). Similar work for automated reasoners has been performed with re- spect to queries [8]. The authors create test units (A-Boxes representations of the query) to form a test base. For certain benchmark ontologies that are used to assess reasoners, this provides a metric that evaluates the completeness of the reasoners. Importantly they also stress the need for such methods to be invariant or independent of the ontology being tested against. Verifying Reasoner Correctness - A Justification Based Method 7 6 Conclusions and Future Work In this paper, we have presented a justification based method to reliably identify bugs in the classification algorithm of OWL reasoners. Our method allows us to narrow down possible sources of bugs, providing a starting point for reasoner debugging. We have evaluated the method in a medium sized setting and showed that it functions as desired. We are now currently evaluating the feasibility of the method on a larger scale. In the future, we would like to provide the method as a web service. Devel- opers would be able to test their reasoners against a set of standard reasoners and then obtain cases that pinpoint potential bugs consisting of justifications, missing entailments and ontology patches (that make the problem disappear). References 1. Edited by Baader F., Calavenese D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F.: The Description Logic Handbook 2nd Edition Cambridge University Press (2007) 2. Bail, S., Parsia, B. Sattler,U.: JustBench: A Framework for OWL Benchmarking. In: The Semantic Web ISWC 2010, pp. 32–47. Springer (2010) 3. Glimm, B., Horrocks, I., Motik, B., Stoilos, G., Wang, Z.: HermiT: An OWL 2 Reasoner,. In: J. Autom. Reasoning, Volume 53, pp.245–269, (2014). http://dx. doi.org/10.1007/s10817-014-9305-1. 4. Gonsalves, R.S., Bail, S., Jimenez-Ruiz, E., Matentzoglu, N., Parsia, B., Glimm, B., Kazakov, Y.: OWL Reasoner Evaluation (ORE) Workshop 2013 Results: Short Report. CEUR Workshop Proceedings, ORE (2013) 5. Horridge, M.: Justification Based Explaination in Ontologies. Thesis, University of Manchester (2011) 6. Horridge, M., Bechhofer, S.: The OWL API: A Java API for OWL ontologies. In: Semantic Web, Volume 2, Number 1, pp.11–21 (2011). http://dx.doi.org/10. 3233/SW-2011-0025 7. Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A. Katz, Y.: Pellet: A practical OWL- DL reasoner. In: J. Web. Sem., Volume 5, Number 2, pp.51–53, (2007) http://dx. doi.org/10.1016/j.websem.2007.03.004, 8. Stoilos, G., Grau, B.C., Horrocks, I.: How Incomplete is Your Semantic Reasoner? Systematic Analysis of the Completeness of Query Answering System. In: AAAI (2010) 9. Tsarkov, D. and Horrocks, I.: FaCT++ Description Logic Reasoner. In: IJCAR pp.292-297 (2006) http://dx.doi.org/10.1007/11814771_26 10. W3C Recommendations RDF Semantics, http://www.w3.org/TR/2004/ REC-rdf-mt-20040210/#DtypeRules