Introduction

The Cognitive Complexity of OWL Justifications

Matthew Horridge

Samantha Bail

Bijan Parsia

Ulrike Sattler

0 0 The University of Manchester Oxford Road , Manchester, M13 9PL

In this paper, we present an approach to determining the cognitive complexity of justifications for entailments of OWL ontologies. We introduce a simple cognitive complexity model and present the results of validating that model via experiments involving OWL users. The validation is based on test data derived from a large and diverse corpus of naturally occurring justifications. Our contributions include validation for the cognitive complexity model, new insights into justification complexity, a significant corpus with novel analyses of justifications suitable for experimentation, and an experimental protocol suitable for model validation and refinement.

Introduction

model does fairly well with some notable exceptions. A follow-up study with an eye tracker and think aloud protocol supports our explanations for the anomalous behaviour and suggests both a refinement to the model and a limitation of our experimental protocol.

While there have been several user studies in the area of debugging [6,4], ontology engineering anti-patterns [9], and an exploratory investigation into features that make justifications difficult to understand [1], to the best of our knowledge there have not been any formal user studies that investigate the cognitive complexity of justifications. 2

Cognitive Complexity & Justifications

In psychology, there is a long standing rivalry between two accounts of human deductive processes: ( 1 ) that people apply inferential rules [8], and ( 2 ) that people construct mental models [2].2 In spite of a voluminous literature (including functional MRI studies), to date there is no scientific consensus [7], even for propositional reasoning.

Even if this debate were settled, it would not be clear how to apply it to ontology engineering. The reasoning problems that are considered in the literature are quite different from understanding how an entailment follows from a justification in an expressive logic. Furthermore, the artificiality of our problems may engage different mechanisms than more “natural” reasoning problems: e.g. even if mental models theory were correct, people can produce natural deduction proofs and might find that they outperform “reasoning natively”. For ontology engineering, we do not need a true account of human deduction, but just need a way to determine how usable justifications are for our tasks. What is required is a theory of the weak cognitive complexity of justifications, not one of strong cognitive complexity [10].

A similar practical task is generating sufficiently difficult so-called “Analytical Reasoning Questions” (ARQs) problems in Graduate Record Examination (GRE) tests. In [7], the investigators constructed and validated a model for the complexity of answering ARQs via experiments with students. Analogously, we aim to validate a model for the complexity of “understanding” justificiations via experiments on modellers. 3

A Complexity Model

We have developed a cognitive complexity model for justification understanding. This model was derived partly from observations made during an exploratory study in which people attempted to understand justifications from naturally occuring ontologies, and partly from intuitions on what makes justifications difficult to understand. Table 1 describes the model, wherein J is the justification in question, is the focal entailment, and each value is multiplied by its weight and then summed with the rest. The final value is a complexity score for the justification. Broadly speaking, there are two types of components: ( 1 ) structural components, such as C1, which require a syntactic analysis of a justification, and ( 2 ) semantic components, such as C4, which require entailment checking to reveal non-obvious phenomena. 2 ( 1 ) can be crudely characterised as people use a natural deduction proof system and ( 2 ) as people use a semantic tableau.

Components C1 and C2 count the number of different kinds of axiom types and class expression types as defined in the OWL 2 Structural Specification.4 The more diverse the basic logical vocabulary is, the less likely that simple pattern matching will work and the more “sorts of things” the user must track.

Component C3 detects the presence of universal restrictions where trivial satisfaction can be used to infer subsumption. Generally, people are often surprised to learn that if hx; yi 62 RI for all y 2 I , then x 2 (8R:C)I . This was observed repeatedly in the exploratory study.

Components C4 and C5 detect the presence of synonyms of > and ? in the signature of a justification where these synonyms are not explicitly introduced via subsumption or equivalence axioms. In the exploratory study, participants failed to spot synonyms of > in particular.

Component C6 detects the presence of a domain axiom that is not paired with an (entailed) existential restriction along the property whose domain is restricted. This typically goes against peoples’ expectations of how domain axioms work, and usually indicates some kind of non-obvious reasoning by cases. For example, given the two axioms 9R:> v C and 8R:D v C, the domain axiom is used to make a statement about objects that have R successors, while the second axiom makes a statement about those objects that do not have any R successors to imply that C is equivalent to >. This is different from the typical pattern of usage, for example where A v 9R:C and 9R:> v B entails A v B.

Component C7 measures maximum modal depth of sub-concepts in J , which tend to generate multiple distinct but interacting propositional contexts.

Component C8 examines the signature difference from entailment to justification. This can indicate confusing redundancy in the entailment, or synonyms of >, that may not be obvious, in the justification. Both cases are surprising to people looking at such justifications. 4 http://www.w3.org/TR/owl2-syntax/

Components C9 and C10 determine if there is a difference between the type of, and types of class expressions in, the axiom representing the entailment of interest and the types of axioms and class expressions that appear in the justification. Any difference can indicate an extra reasoning step to be performed by a person looking at the justification.

Component C11 examines the number of subclass axioms that have a complex left hand side in a laconic version of the justification. Complex class expressions on the left hand side of subclass axioms in a laconic justification indicate that the conclusions of several intermediate reasoning steps may interact.

Component C12 examines the number of obvious syntactic subsumption paths through a justification. In the exploratory study, participants found it very easy to quickly read chains of subsumption axioms, for example, fA v B; B v C; D v D; D v Eg to entail A v E. This complexity component essentially increases the complexity when these kinds of paths are lacking.

The weights were determined by rough and ready empirical twiddling, without a strong theoretical or specific experimental backing. They correspond to our sense, esp. from the exploratory study, of sufficient reasons for difficulty. 4

Experiments

While the model is plausible and seems to behave reasonably well in applications, its validation is a challenging topic. In principle, the model is reasonable if it successfully predicts the difficulty an arbitrary OWL modeller has with an arbitrary justification sufficiently often. Unfortunately, the space of ontology developers and of OWL justifications (even of existing, naturally occurring ones) is large and heterogeneous enough to be difficult to randomly sample. 4.1

Design Challenges

To cope with the heterogeneity of users, any experimental protocol should require minimal experimental interaction, i.e. it should be executable over the internet from subjects’ own machines with simple installation. Such a protocol trades access to subjects, over time, for the richness of data gathered. To this end, we adapted one of the experimental protocols described in [7] and tested it on a more homogeneous set of participants—a group of MSc students who had completed a lecture course on OWL.

While the general experimental protocol in [7] seems reasonable, there are some issues in adapting it to our case. In particular, in ARQs there is a restricted space of possible (non-)entailments suitable for multiple choice questions. That is, the wrong answers can straightforwardly be made plausible enough to avoid guessing. (The questions are, in essence, enumeration problems.) A justification inherently has one statement that it is a justification for (even though it will be a minimal entailing subset for others). Thus, there isn’t a standard “multiple set” of probable answers to draw on. In the exam case, the primary task is successfully answering the question and the relation between that success and predictions about the test taker are outside the remit of the experiment (but there is an established account, both theoretically and empirically). In the justification case the standard primary task is “understanding” the relationship between the justification and the entailment. Without observation, it is impossible to distinguish between a participant who really “gets” it and one who merely acquiesces. In the exploratory study we performed to help develop the model, we had the participant rank the difficulty of the justification, but also used think aloud and follow-up questioning to verify the success in understanding by the participant. This is obviously not a minimal intervention, and requires a large amount of time and resources on the part of the investigators.

To counter this, the task was shifted from justification understanding task to something more measurable and similar to the question answering task in [7]. In particular, instead of presenting the justification/entailment pair as a justification/entailment pair and asking the participant to try to “understand” it, we present the justification/entailment pair as a set of axioms/candidate entailment pair and ask the participant to determine whether the candidate is, in fact, entailed. This diverges from the standard justification situation wherein the modeller knows that the axioms entail the candidate (and form a justification), but provides a metric that can be correlated with cognitive complexity, which is error proportions.

4.2 Justification Corpus

To cope with the heterogeneity of justifications, we derived a large sample of justifications from ontologies from several well known ontology repositories: The Stanford BioPortal repository5 (30 ontologies plus imports closure), the Dumontier Lab ontology collection6 (15 ontologies plus imports closure), the OBO XP collection7 (17 ontologies plus imports closure) and the TONES repository8 (36 ontologies plus imports closure). To be selected, an ontology had to ( 1 ) entail one subsumption between class names with at least one justification that (a) was not the entailment itself, and (b) contains axioms in that ontology (as opposed to the imports closure of the ontology), ( 2 ) be downloadable and loadable by the OWL API ( 3 ) processable by FaCT++.

While the selected ontologies cannot be said to generate a truly representative sample of justifications from the full space of possible justifications (even of those on the Web), they are diverse enough to put stress on many parts of the model. Moreover, most of these ontologies are actively developed and used and hence provide justifications that a significant class of users encounter.

For each ontology, the class hierarchy was computed, from which direct subsumptions between class names were extracted. For each direct subsumption, as many justifications as possible in the space of 10 minutes were computed (typically all justifications; time-outs were rare). This resulted in a pool of over 64,800 justifications.

While large, the actual logical diversity of this pool is considerably smaller. This is because many justifications, for different entailments, were of exactly the same “shape”. For example, consider J1 = fA v B; B v Cg j= A v C and J2 = fF v E; E v Gg j= F v G. As can be seen, there is an injective renaming from J1 to J2, and J1 is 5 http://bioportal.bioontology.org 6 http://dumontierlab.com/?page=ontologies 7 http://www.berkeleybop.org/ontologies/ 8 http://owl.cs.manchester.ac.uk/repository/ therefore isomorphic with J2. If a person can understand J1 then, with allowances for variations in name length, they should be able to understand J2. The initial large pool was therefore reduced to a smaller pool of 11,600 non-isomorphic justifications. 4.3

Items and Item Selection

Each experiment consists of a series of test items (questions from a participant point of view). A test item consists of a set of axioms, one following axiom, and a question, “Do these axioms entail the following axiom?”. A participant response is one of five possible answers: “Yes” (it is entailed), “Yes, but not sure”, “Not Sure”, “No, but not sure”, “No” (it is not entailed). From a participant point of view, any item may or may not contain a justification. However, in our experiments, every item was, in fact, a justification.

It is obviously possible to have non-justification entailing sets or non-entailing sets of axioms in an item. We chose against such items since ( 1 ) we wanted to maximize the number of actual justifications examined ( 2 ) justification understanding is the actual task at hand, and ( 3 ) it is unclear how to interpret error rates for non-entailments in light of the model. For some subjects, esp. those with little or no prior exposure to justifications, it was unclear whether they understood the difference between the set merely being entailing, and it being minimal and entailing. We did observe one person who made use of this metalogical reasoning in the follow-up study.

Item Construction: For each experiment detailed below, test items were constructed from the pool of 11,600 non-isomorphic justifications. First, in order to reduce variance due primarily to size, justifications whose size was less than 4 axioms and greater than 10 axioms were discarded. This left 3199 (28%) justifications in the pool. In particular, this excluded large justifications that might require a lot of reading time, cause fatigue problems, or intimidate, and excluded very small justifications that tended to be trivial.9

For each justification in the pool of the remaining 3199 non-isomorphic justifications, the complexity of the justification was computed according to the model presented in Table 1, and then the justification was assigned to a complexity bin. A total of 11 bins were constructed over the range of complexity (from 0 to 2200), each with a complexity interval of 200. We discarded all bins which had 0 non-isomorphic justifications of size 4-10. This left 8 bins partitioning a complexity range of 200-1800.

Figure 1 illustrates a key issue. The bulk of the justifications (esp. without the trivial), both with and without isomorphic reduction, are in the middle complexity range. However, the model is not sophisticated enough that small differences (e.g. below a difference of 400-600) are plausibly meaningful. It is unclear whether the noise from variance in participant abilities would wash out the noise from the complexity model. In other words, just from reflection on the model, justifications whose complexity difference is 400 or less do not seem reliably distinguishable by error rates. Furthermore, non-isomorphism does not eliminate all non-significant logical variance. Consider a 9 Note that, as a result, nearly 40% of all justifications (essentially, the 0-200 bin) have no representative in the pruned set (see Figure 2). Inspection revealed that most of these were trivial single axiom justifications (e.g. of the form fA Bg j= A v B or fA (B u C)g j= A v B, etc. chain of two atomic subsumptions vs. a chain of three. They have the same basic logical structure, but are not isomorphic. Thus, we cannot yet say whether this apparent concentration is meaningful.

Since we did not expect to be able to present more than 6 items and keep to our time limits, we chose to focus on a “easy/hard” divide of the lowest three non-empty bins (200-800) and the highest three non-empty bins (1200-1800). While this limits the claims we can make about model performance over the entire corpus, it, at least, strengthens negative results. If error rates overall do not distinguish the two poles (where we expect the largest effect) then either the model fails or error rates are not a reliable marker. Additionally, since if there is an effect, we expect it to be largest in this scenario thus making it easier to achieve adequate statistical power.

Each experiment involved a fixed set of test items, which were selected by randomly drawing items from preselected spread of bins, as described below. Please note that the selection procedure changed in the light of the pilot study, but only to make the selection more challenging for the model.

The final stage of item construction was justification obfuscation. All non-logical terms were replaced with generated symbols. Thus, there was no possibility of using domain knowledge to understand these justifications. The names were all uniform, syntactically distinguishable (e.g. class names from property names) and quite short. The entailment was the same for all items (i.e. C1 v C2). It is possible that dealing with these purely symbolic justifications distorted participant response from response in the field, even beyond blocking domain knowledge. For example, they could be alienating and thus increase error rates or they could engage less error prone pattern recognition.

Fig. 1. Justification Corpus Complexity Distribution "! "! "&*' "# "' "+% "''$+ "!#$ "#*' "(+' "$%#& "$&!) "((( "(& "+$$ ") "($% "(*( "()!$ "# "%$ "$' "%$! "! "! "( "' "! "! ") "%$ !","%!!" %!!","#!!" #!!","'!!" '!!","+!!" +!!","$!!!" $!!!","$%!!" $%!!","$#!!" $#!!","$'!!" $'!!","$+!!" $+!!","%!!!" %!!!","%%!!"

6(#75%8139'203%&4/5' " $ & # $ % "!#(& "%#+% " ( $ # & " % + % ( " ' "'( %(& $ $ " ' ( # ) " ! ' #! "*% + % " ( % & ! $ -./0"#","$!"123,45262789.:" -./0"#","$!";<.3"=22>" 123,45262789.:" ;<.3"=22>" " ) ! + % " ! ! + & %(!!!" '5/%!!!!" 4 & % ''123+0$(!!!" 0 0 ( , / . , + '*"$!!!!" ) ( ' & % $ # "! (!!!"

!" 4.4

Pilot study

Participants: Seven members of a Computer Science (CS) Academic or Research Staff, or PhD Program, with over 2 years of experience with ontologies and justifications. Materials and procedures: The study was performed using an in-house web based survey tool, which tracks times between all clicks on the page and thus records the time to make each decision.

The participants were given a series of test items consisting of 3 practice items, followed by 1 common easy item (E1 of complexity 300) and four additional items, 2 ranked easy (E2 and E3 of complexities 544 and 690, resp.) and 2 ranked hard (H1 and H2 of complexities 1220 and 1406), which were randomly (but distinctly) ordered for each participant. The easy items were drawn from bins 200-800, and the hard items from bins 1200-1800. The expected time to complete the study was a maximum of 30 minutes, including the orientation, practice items, and brief demographic questionnaire (taken after all items were completed).

Results: Errors and times are given in Table 2(a). Since all of the items were in fact justifications, participant responses were recoded to success or failure as follows: Success = (“Yes” j “Yes, but not sure”) and Failure = (“Not sure” j “No, Not sure” j “No”). Error proportions were analysed using Cochran’s Q Test, which takes into consideration the pairing of successes and failures for a given participant. Times were analysed using two tailed paired sample t-tests.

An initial Cochran Q Test across all items revealed a strong significant difference in error proportions between the items [Q( 4 ) = 16:00, p = 0:003]. Further analysis using Cochran’s Q Test on pairs of items revealed strong statistically significant differences in error proportion between: E1/H1 [Q( 1 ) = 6:00, p = 0:014], E1/H2 [Q( 1 ) = 6:00, p = 0:014] E2/H2 [Q( 1 ) = 5:00, p = 0:025] and E3/H2 [Q( 1 ) = 5:00, p = 0:025]. The differences in the remaining pairs, while not exhibiting differences above p = 0:05, were quite close to significance, i.e. E2/H1 [Q( 1 ) = 3:57, p = 0:059] and E3/H1 [Q( 1 ) = 2:70, p = 0:10]. In summary, these error rate results were encouraging.

An analysis of times using paired sample t-tests revealed that time spent understanding a particular item is not a good predictor of complexity. While there were significant differences in the times for E1/H1 [p = 0:00016], E2/H1 [p = 0:025], and E3/H1 [p = 0:023], there were no significant differences in the times for E1/H2 [p = 0:15], E2/H2 [p = 0:34] and E3/H2 [p = 0:11]. This result was anticipated, as in the exploratory study people gave up very quickly for justifications that they felt they could not understand. Participants: 14 volunteers from a CS MSc class on OWL ontology modelling, who were given chocolate for their participation. Each participant had minimal exposure to OWL (or logic) before the class, but had, in the course of the prior 5 weeks, constructed or manipulated several ontologies, and received an overview of the basics of OWL 2, reasoning, etc. They did not receive any specific training on justifications. Materials and procedures: The study was performed according to the protocol used in the pilot study. A new set of items were used. Since the mean time taken by pilot study participants to complete the survey was 13.65 minutes, with a standard deviation of 4.87 minutes, an additional hard justification was added to the test items. Furthermore, all of the items with easy justifications ranked easy were drawn from the highest easy complexity bin (bin 600-800). In the pilot study, we observed that the lower ranking easy items were found to be quite easy and, by inspection of their bins, we found that it was quite likely to draw similar justifications. The third bin (600-800) is much larger and logically diverse, thus is more challenging for the model.

The series consisted of 3 practice items followed by 6 additional items, 3 easy items(EM1, EM2 and EM3 of complexities: 654, 703, and 675), and 3 hard items (HM1, HM2 and HM3 of complexities: 1380, 1395, and 1406). The items were randomly ordered for each participant. Again, the expectation of the time to complete the study was a maximum of 30 minutes, including orientation, practice items and brief demographic questionnaire.

Results Errors and times are presented in Table 2(b). The coding to error is the same as in the pilot. An analysis with Cochran’s Q Test across all items reveals a significant difference in error proportion [Q( 5 ) = 15:095; p = 0:0045].

A pairwise analysis between easy and hard items reveals that there are significant and, highly significant, differences in errors between EM1/HM1 [Q( 1 ) = 4:50, p = 0:034], EM1/HM2 [Q( 1 ) = 7:00, p = 0:008], EM2/HM1 [Q( 1 ) = 4:50, p = 0:034], EM2/HM2 [Q( 1 ) = 5:44, p = 0:02], and EM3/HM2 [Q( 1 ) = 5:44, p = 0:02].

However, there were no significant differences between EM1/HM3 [Q( 1 ) = 0:00, p = 1:00], EM2/HM3 [Q( 1 ) = 0:00, p = 1:00], EM3/HM3 [Q( 1 ) = 2:00, p = 0:16] and EM3/HM1 [Q( 1 ) = 0:67, p = 0:41].

With regards to the nonsignificant differences between certain easy and hard items, there are two items which stand out: An easy item EM3 and a hard item HM3, which are shown in Figure 2.

In line with the results from the pilot study, an analysis of times using a paired samples t-test revealed significant differences between some easy and hard items, with those easy times being significantly less than the hard times EM1/HM1 [p = 0:023], EM2/HM2 [p = 0:016] and EM3/HM1 [p = 0:025]. However, for other pairs of easy and hard items, times were not significantly different: EM1/HM1 [p = 0:43], EM2/HM1 [p = 0:11] and EM3/HM2 [p = 0:10]. Again, time is not a reliable predictor of model complexity.

Anomalies in Experiment 1: Two items (EM3 and HM3) did not exhibit their predicted error rate relations. For item EM3, we conjectured that a certain pattern of superfluous axiom parts in the item (not recognisable by the model) made it harder than the model predicted. That is, that the model was wrong.

For item HM3 we conjectured that the model correctly identifies this item as hard,10 but that the MSc students answered “Yes” because of misleading pattern of axioms at the start and end of item HM3. The high “success” rate was due to an error in reasoning, that is, a failure in understanding.

In order to determine whether our conjectures were possible and reasonable, we conducted a follow-up study with the goal of observing the conjectured behaviours in situ. Note that this study does not explain what happened in Experiment 1. 4.6

Experiment 2

Participants: Two CS Research Associates and one CS PhD student, none of whom had taken part in the pilot study. All participants were very experienced with OWL. Materials and procedures: Items and protocol were exactly the same as Experiment 1, with the addition of the think aloud protocol. Furthermore, the screen, participant vocalization, and eye tracking were recorded.

Results: With regard to EM3, think aloud revealed that all participants were distracted by the superfluous axiom parts in item EM3. Figure 2 shows an eye tracker heat map for the most extreme case of distraction in item EM3. As can be seen, hot spots lie over the superfluous parts of axioms. Think aloud revealed that all participants initially tried to see how the 9prop1:C6 conjunct in the third axiom contributed to the entailment and struggled when they realised that this was not the case.

In the case of HM3, think aloud revealed that none of the participants understood how the entailment followed from the set of axioms. However, two of them responded correctly and stated that the entailment did hold. As conjectured, the patterns formed by the start and end axioms in the item set seemed to mislead them. In particular, when disregarding quantifiers, the start axiom C1 v 8prop1:C3 and the end axiom C2 v 9prop1:C3 t : : : look very similar. One participant spotted this similarity and claimed that the entailment held as a result. Hot spots occur over the final axiom and the first axiom in the eye tracker heat map (Figure 2), with relatively little activity in the axioms in the middle of the justification. 10 It had been observed to stymie experienced modellers in the field. Furthermore, it involves deriving a synonym for >, which was not a move this cohort had experience with. In this paper we presented a methodology for validating the predicted cognitive complexity of justifications. The main advantages of the experimental protocol used in the methodology is that minimal study facilitator intervention is required. This means that, over time, it should be possible to collect rich and varied data fairly cheaply and from geographically distributed participants. In addition to this, given a justification corpus and population of interest, the main experiment is easily repeatable with minimal resources and setup. Care must be taken in interpreting results and, in particular, the protocol is weak on “too hard” justifications as it cannot distinguish a model mislabeling from people failing for the wrong reason.

The cognitive complexity model that was presented in this paper fared reasonably well. In most cases, there was a significant difference in error proportion between model ranked easy and hard justifications. In the cases where error proportions revealed no difference better than chance, further small scale follow-up studies in the form of a more expensive talk-aloud study was used to gain an insight into the problems. These inspections highlighted an area for model improvement, namely in the area of superfluity. It is unclear how to rectify this in the model, as there could be justifications with superfluous parts that are trivial to understand, but the location and shape of superfluity seem an important factor.

The refinement and validation of our model is an ongoing task and will require considerably more experimental cycles. We plan to conduct a series of experiments with different cohorts as well as with an expanded corpus. We also plan to continue the analysis of our corpus with an eye to performing experiments to validate the model over the whole (for some given population).

Horridge ,

Parsia , and

Sattler . Lemmas for justifications in OWL . In DL 2009 .

P. N.

Johnson-Laird and

R. M. J.

Byrne . Deduction. Psychology Press, 1991 .

Kalyanpur ,

Parsia , E. Sirin, and

Grau . Repairing unsatisfiable concepts in OWL ontologies . In Proc. of ESWC-06 , pages 170 - 184 , 2006 .

Kalyanpur ,

Parsia , E. Sirin, and

Hendler . Debugging unsatisfiable classes in OWL ontologies . Journal of Web Semantics , 3 ( 4 ), 2005 .

Kazakov . RIQ and SROIQ are harder than SHOIQ . In Proc. KR 2008 . AAAI Press, 2008 .

S. C. J.

Lam . Methods for Resolving Inconsistencies In Ontologies . PhD thesis , Department of Computer Science, Aberdeen, 2007 .

Newstead ,

Brandon ,

Handley , I. Dennis, and

J. S. B.

Evans . Predicting the difficulty of complex logical reasoning problems . Psychology Press, 12 , 2006 .

L. J.

Rips . The Psychology of Proof. MIT Press, Cambridge, MA, 1994 .

Roussey ,

Corcho , and L. Vilches-Bla´zquez. A catalogue of OWL ontology antipatterns . In Proc. of K-CAP-09 , pages 205 - 206 , 2009 .

10. G. Strube. The role of cognitive science in knowledge engineering . In Contemporary Knowledge Engineering and Cognition , 1992 .