<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Cognitive Complexity of OWL Justifications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matthew Horridge</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samantha Bail</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bijan Parsia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ulrike Sattler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Manchester Oxford Road</institution>
          ,
          <addr-line>Manchester, M13 9PL</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present an approach to determining the cognitive complexity of justifications for entailments of OWL ontologies. We introduce a simple cognitive complexity model and present the results of validating that model via experiments involving OWL users. The validation is based on test data derived from a large and diverse corpus of naturally occurring justifications. Our contributions include validation for the cognitive complexity model, new insights into justification complexity, a significant corpus with novel analyses of justifications suitable for experimentation, and an experimental protocol suitable for model validation and refinement.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>model does fairly well with some notable exceptions. A follow-up study with an eye
tracker and think aloud protocol supports our explanations for the anomalous behaviour
and suggests both a refinement to the model and a limitation of our experimental
protocol.</p>
      <p>While there have been several user studies in the area of debugging [6,4], ontology
engineering anti-patterns [9], and an exploratory investigation into features that make
justifications difficult to understand [1], to the best of our knowledge there have not
been any formal user studies that investigate the cognitive complexity of justifications.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Cognitive Complexity &amp; Justifications</title>
      <p>
        In psychology, there is a long standing rivalry between two accounts of human
deductive processes: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) that people apply inferential rules [8], and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) that people construct
mental models [2].2 In spite of a voluminous literature (including functional MRI
studies), to date there is no scientific consensus [7], even for propositional reasoning.
      </p>
      <p>Even if this debate were settled, it would not be clear how to apply it to ontology
engineering. The reasoning problems that are considered in the literature are quite
different from understanding how an entailment follows from a justification in an expressive
logic. Furthermore, the artificiality of our problems may engage different mechanisms
than more “natural” reasoning problems: e.g. even if mental models theory were
correct, people can produce natural deduction proofs and might find that they outperform
“reasoning natively”. For ontology engineering, we do not need a true account of
human deduction, but just need a way to determine how usable justifications are for our
tasks. What is required is a theory of the weak cognitive complexity of justifications, not
one of strong cognitive complexity [10].</p>
      <p>A similar practical task is generating sufficiently difficult so-called “Analytical
Reasoning Questions” (ARQs) problems in Graduate Record Examination (GRE) tests. In
[7], the investigators constructed and validated a model for the complexity of answering
ARQs via experiments with students. Analogously, we aim to validate a model for the
complexity of “understanding” justificiations via experiments on modellers.
3</p>
    </sec>
    <sec id="sec-3">
      <title>A Complexity Model</title>
      <p>
        We have developed a cognitive complexity model for justification understanding. This
model was derived partly from observations made during an exploratory study in which
people attempted to understand justifications from naturally occuring ontologies, and
partly from intuitions on what makes justifications difficult to understand. Table 1
describes the model, wherein J is the justification in question, is the focal entailment,
and each value is multiplied by its weight and then summed with the rest. The final
value is a complexity score for the justification. Broadly speaking, there are two types
of components: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) structural components, such as C1, which require a syntactic
analysis of a justification, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) semantic components, such as C4, which require entailment
checking to reveal non-obvious phenomena.
2 (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) can be crudely characterised as people use a natural deduction proof system and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) as
people use a semantic tableau.
      </p>
      <p>Components C1 and C2 count the number of different kinds of axiom types and
class expression types as defined in the OWL 2 Structural Specification.4 The more
diverse the basic logical vocabulary is, the less likely that simple pattern matching will
work and the more “sorts of things” the user must track.</p>
      <p>Component C3 detects the presence of universal restrictions where trivial
satisfaction can be used to infer subsumption. Generally, people are often surprised to learn
that if hx; yi 62 RI for all y 2 I , then x 2 (8R:C)I . This was observed repeatedly in
the exploratory study.</p>
      <p>Components C4 and C5 detect the presence of synonyms of &gt; and ? in the
signature of a justification where these synonyms are not explicitly introduced via
subsumption or equivalence axioms. In the exploratory study, participants failed to spot
synonyms of &gt; in particular.</p>
      <p>Component C6 detects the presence of a domain axiom that is not paired with an
(entailed) existential restriction along the property whose domain is restricted. This
typically goes against peoples’ expectations of how domain axioms work, and usually
indicates some kind of non-obvious reasoning by cases. For example, given the two
axioms 9R:&gt; v C and 8R:D v C, the domain axiom is used to make a statement
about objects that have R successors, while the second axiom makes a statement about
those objects that do not have any R successors to imply that C is equivalent to &gt;.
This is different from the typical pattern of usage, for example where A v 9R:C and
9R:&gt; v B entails A v B.</p>
      <p>Component C7 measures maximum modal depth of sub-concepts in J , which tend
to generate multiple distinct but interacting propositional contexts.</p>
      <p>Component C8 examines the signature difference from entailment to justification.
This can indicate confusing redundancy in the entailment, or synonyms of &gt;, that may
not be obvious, in the justification. Both cases are surprising to people looking at such
justifications.
4 http://www.w3.org/TR/owl2-syntax/</p>
      <p>Components C9 and C10 determine if there is a difference between the type of, and
types of class expressions in, the axiom representing the entailment of interest and the
types of axioms and class expressions that appear in the justification. Any difference can
indicate an extra reasoning step to be performed by a person looking at the justification.</p>
      <p>Component C11 examines the number of subclass axioms that have a complex left
hand side in a laconic version of the justification. Complex class expressions on the left
hand side of subclass axioms in a laconic justification indicate that the conclusions of
several intermediate reasoning steps may interact.</p>
      <p>Component C12 examines the number of obvious syntactic subsumption paths through
a justification. In the exploratory study, participants found it very easy to quickly read
chains of subsumption axioms, for example, fA v B; B v C; D v D; D v Eg to
entail A v E. This complexity component essentially increases the complexity when
these kinds of paths are lacking.</p>
      <p>The weights were determined by rough and ready empirical twiddling, without a
strong theoretical or specific experimental backing. They correspond to our sense, esp.
from the exploratory study, of sufficient reasons for difficulty.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>While the model is plausible and seems to behave reasonably well in applications, its
validation is a challenging topic. In principle, the model is reasonable if it successfully
predicts the difficulty an arbitrary OWL modeller has with an arbitrary justification
sufficiently often. Unfortunately, the space of ontology developers and of OWL
justifications (even of existing, naturally occurring ones) is large and heterogeneous enough
to be difficult to randomly sample.
4.1</p>
      <sec id="sec-4-1">
        <title>Design Challenges</title>
        <p>To cope with the heterogeneity of users, any experimental protocol should require
minimal experimental interaction, i.e. it should be executable over the internet from subjects’
own machines with simple installation. Such a protocol trades access to subjects, over
time, for the richness of data gathered. To this end, we adapted one of the experimental
protocols described in [7] and tested it on a more homogeneous set of participants—a
group of MSc students who had completed a lecture course on OWL.</p>
        <p>While the general experimental protocol in [7] seems reasonable, there are some
issues in adapting it to our case. In particular, in ARQs there is a restricted space of
possible (non-)entailments suitable for multiple choice questions. That is, the wrong answers
can straightforwardly be made plausible enough to avoid guessing. (The questions are,
in essence, enumeration problems.) A justification inherently has one statement that it
is a justification for (even though it will be a minimal entailing subset for others). Thus,
there isn’t a standard “multiple set” of probable answers to draw on. In the exam case,
the primary task is successfully answering the question and the relation between that
success and predictions about the test taker are outside the remit of the experiment (but
there is an established account, both theoretically and empirically). In the justification
case the standard primary task is “understanding” the relationship between the
justification and the entailment. Without observation, it is impossible to distinguish between a
participant who really “gets” it and one who merely acquiesces. In the exploratory study
we performed to help develop the model, we had the participant rank the difficulty of
the justification, but also used think aloud and follow-up questioning to verify the
success in understanding by the participant. This is obviously not a minimal intervention,
and requires a large amount of time and resources on the part of the investigators.</p>
        <p>To counter this, the task was shifted from justification understanding task to
something more measurable and similar to the question answering task in [7]. In particular,
instead of presenting the justification/entailment pair as a justification/entailment pair
and asking the participant to try to “understand” it, we present the justification/entailment
pair as a set of axioms/candidate entailment pair and ask the participant to determine
whether the candidate is, in fact, entailed. This diverges from the standard justification
situation wherein the modeller knows that the axioms entail the candidate (and form
a justification), but provides a metric that can be correlated with cognitive complexity,
which is error proportions.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Justification Corpus</title>
        <p>
          To cope with the heterogeneity of justifications, we derived a large sample of
justifications from ontologies from several well known ontology repositories: The Stanford
BioPortal repository5 (30 ontologies plus imports closure), the Dumontier Lab ontology
collection6 (15 ontologies plus imports closure), the OBO XP collection7 (17 ontologies
plus imports closure) and the TONES repository8 (36 ontologies plus imports closure).
To be selected, an ontology had to (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) entail one subsumption between class names with
at least one justification that (a) was not the entailment itself, and (b) contains axioms in
that ontology (as opposed to the imports closure of the ontology), (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) be downloadable
and loadable by the OWL API (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) processable by FaCT++.
        </p>
        <p>While the selected ontologies cannot be said to generate a truly representative
sample of justifications from the full space of possible justifications (even of those on the
Web), they are diverse enough to put stress on many parts of the model. Moreover, most
of these ontologies are actively developed and used and hence provide justifications that
a significant class of users encounter.</p>
        <p>For each ontology, the class hierarchy was computed, from which direct
subsumptions between class names were extracted. For each direct subsumption, as many
justifications as possible in the space of 10 minutes were computed (typically all
justifications; time-outs were rare). This resulted in a pool of over 64,800 justifications.</p>
        <p>While large, the actual logical diversity of this pool is considerably smaller. This is
because many justifications, for different entailments, were of exactly the same “shape”.
For example, consider J1 = fA v B; B v Cg j= A v C and J2 = fF v E; E v
Gg j= F v G. As can be seen, there is an injective renaming from J1 to J2, and J1 is
5 http://bioportal.bioontology.org
6 http://dumontierlab.com/?page=ontologies
7 http://www.berkeleybop.org/ontologies/
8 http://owl.cs.manchester.ac.uk/repository/
therefore isomorphic with J2. If a person can understand J1 then, with allowances for
variations in name length, they should be able to understand J2. The initial large pool
was therefore reduced to a smaller pool of 11,600 non-isomorphic justifications.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Items and Item Selection</title>
        <p>Each experiment consists of a series of test items (questions from a participant point of
view). A test item consists of a set of axioms, one following axiom, and a question, “Do
these axioms entail the following axiom?”. A participant response is one of five possible
answers: “Yes” (it is entailed), “Yes, but not sure”, “Not Sure”, “No, but not sure”, “No”
(it is not entailed). From a participant point of view, any item may or may not contain a
justification. However, in our experiments, every item was, in fact, a justification.</p>
        <p>
          It is obviously possible to have non-justification entailing sets or non-entailing sets
of axioms in an item. We chose against such items since (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) we wanted to maximize the
number of actual justifications examined (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) justification understanding is the actual
task at hand, and (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) it is unclear how to interpret error rates for non-entailments in
light of the model. For some subjects, esp. those with little or no prior exposure to
justifications, it was unclear whether they understood the difference between the set
merely being entailing, and it being minimal and entailing. We did observe one person
who made use of this metalogical reasoning in the follow-up study.
        </p>
        <p>Item Construction: For each experiment detailed below, test items were constructed
from the pool of 11,600 non-isomorphic justifications. First, in order to reduce variance
due primarily to size, justifications whose size was less than 4 axioms and greater than
10 axioms were discarded. This left 3199 (28%) justifications in the pool. In particular,
this excluded large justifications that might require a lot of reading time, cause fatigue
problems, or intimidate, and excluded very small justifications that tended to be trivial.9</p>
        <p>For each justification in the pool of the remaining 3199 non-isomorphic
justifications, the complexity of the justification was computed according to the model
presented in Table 1, and then the justification was assigned to a complexity bin. A total
of 11 bins were constructed over the range of complexity (from 0 to 2200), each with a
complexity interval of 200. We discarded all bins which had 0 non-isomorphic
justifications of size 4-10. This left 8 bins partitioning a complexity range of 200-1800.</p>
        <p>Figure 1 illustrates a key issue. The bulk of the justifications (esp. without the
trivial), both with and without isomorphic reduction, are in the middle complexity range.
However, the model is not sophisticated enough that small differences (e.g. below a
difference of 400-600) are plausibly meaningful. It is unclear whether the noise from
variance in participant abilities would wash out the noise from the complexity model.
In other words, just from reflection on the model, justifications whose complexity
difference is 400 or less do not seem reliably distinguishable by error rates. Furthermore,
non-isomorphism does not eliminate all non-significant logical variance. Consider a
9 Note that, as a result, nearly 40% of all justifications (essentially, the 0-200 bin) have no
representative in the pruned set (see Figure 2). Inspection revealed that most of these were
trivial single axiom justifications (e.g. of the form fA Bg j= A v B or fA (B u C)g j=
A v B, etc.
chain of two atomic subsumptions vs. a chain of three. They have the same basic
logical structure, but are not isomorphic. Thus, we cannot yet say whether this apparent
concentration is meaningful.</p>
        <p>Since we did not expect to be able to present more than 6 items and keep to our
time limits, we chose to focus on a “easy/hard” divide of the lowest three non-empty
bins (200-800) and the highest three non-empty bins (1200-1800). While this limits
the claims we can make about model performance over the entire corpus, it, at least,
strengthens negative results. If error rates overall do not distinguish the two poles
(where we expect the largest effect) then either the model fails or error rates are not
a reliable marker. Additionally, since if there is an effect, we expect it to be largest in
this scenario thus making it easier to achieve adequate statistical power.</p>
        <p>Each experiment involved a fixed set of test items, which were selected by randomly
drawing items from preselected spread of bins, as described below. Please note that the
selection procedure changed in the light of the pilot study, but only to make the selection
more challenging for the model.</p>
        <p>The final stage of item construction was justification obfuscation. All non-logical
terms were replaced with generated symbols. Thus, there was no possibility of using
domain knowledge to understand these justifications. The names were all uniform,
syntactically distinguishable (e.g. class names from property names) and quite short. The
entailment was the same for all items (i.e. C1 v C2). It is possible that dealing with
these purely symbolic justifications distorted participant response from response in the
field, even beyond blocking domain knowledge. For example, they could be alienating
and thus increase error rates or they could engage less error prone pattern recognition.</p>
        <p>Fig. 1. Justification Corpus Complexity Distribution
"! "! "&amp;*' "# "' "+% "''$+ "!#$ "#*' "(+' "$%#&amp; "$&amp;!) "((( "(&amp; "+$$ ") "($% "(*( "()!$ "# "%$ "$' "%$! "! "! "( "' "! "! ") "%$
!","%!!" %!!","#!!" #!!","'!!" '!!","+!!" +!!","$!!!" $!!!","$%!!" $%!!","$#!!" $#!!","$'!!" $'!!","$+!!" $+!!","%!!!" %!!!","%%!!"</p>
        <p>6(#75%8139'203%&amp;4/5'
"
$
&amp;
#
$
%
"!#(&amp; "%#+%
"
(
$
#
&amp;
"
%
+
%
(
"
'
"'( %(&amp;
$
$
"
'
(
#
)
"
!
'
#! "*%
+
%
"
(
%
&amp;
!
$
-./0"#","$!"123,45262789.:"
-./0"#","$!";&lt;.3"=22&gt;"
123,45262789.:"
;&lt;.3"=22&gt;"
"
)
!
+
%
"
!
!
+
&amp;
%(!!!"
'5/%!!!!"
4
&amp;
%
''123+0$(!!!"
0
0
(
,
/
.
,
+
'*"$!!!!"
)
(
'
&amp;
%
$
#
"! (!!!"</p>
        <p>!"
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Pilot study</title>
        <p>Participants: Seven members of a Computer Science (CS) Academic or Research Staff,
or PhD Program, with over 2 years of experience with ontologies and justifications.
Materials and procedures: The study was performed using an in-house web based
survey tool, which tracks times between all clicks on the page and thus records the time
to make each decision.</p>
        <p>The participants were given a series of test items consisting of 3 practice items,
followed by 1 common easy item (E1 of complexity 300) and four additional items,
2 ranked easy (E2 and E3 of complexities 544 and 690, resp.) and 2 ranked hard (H1
and H2 of complexities 1220 and 1406), which were randomly (but distinctly) ordered
for each participant. The easy items were drawn from bins 200-800, and the hard items
from bins 1200-1800. The expected time to complete the study was a maximum of 30
minutes, including the orientation, practice items, and brief demographic questionnaire
(taken after all items were completed).</p>
        <p>Results: Errors and times are given in Table 2(a). Since all of the items were in fact
justifications, participant responses were recoded to success or failure as follows: Success
= (“Yes” j “Yes, but not sure”) and Failure = (“Not sure” j “No, Not sure” j “No”). Error
proportions were analysed using Cochran’s Q Test, which takes into consideration the
pairing of successes and failures for a given participant. Times were analysed using two
tailed paired sample t-tests.</p>
        <p>
          An initial Cochran Q Test across all items revealed a strong significant difference in
error proportions between the items [Q(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) = 16:00, p = 0:003]. Further analysis using
Cochran’s Q Test on pairs of items revealed strong statistically significant differences
in error proportion between: E1/H1 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 6:00, p = 0:014], E1/H2 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 6:00,
p = 0:014] E2/H2 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 5:00, p = 0:025] and E3/H2 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 5:00, p = 0:025].
The differences in the remaining pairs, while not exhibiting differences above p = 0:05,
were quite close to significance, i.e. E2/H1 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 3:57, p = 0:059] and E3/H1
[Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 2:70, p = 0:10]. In summary, these error rate results were encouraging.
        </p>
        <p>An analysis of times using paired sample t-tests revealed that time spent
understanding a particular item is not a good predictor of complexity. While there were significant
differences in the times for E1/H1 [p = 0:00016], E2/H1 [p = 0:025], and E3/H1
[p = 0:023], there were no significant differences in the times for E1/H2 [p = 0:15],
E2/H2 [p = 0:34] and E3/H2 [p = 0:11]. This result was anticipated, as in the
exploratory study people gave up very quickly for justifications that they felt they could
not understand.
Participants: 14 volunteers from a CS MSc class on OWL ontology modelling, who
were given chocolate for their participation. Each participant had minimal exposure to
OWL (or logic) before the class, but had, in the course of the prior 5 weeks, constructed
or manipulated several ontologies, and received an overview of the basics of OWL 2,
reasoning, etc. They did not receive any specific training on justifications.
Materials and procedures: The study was performed according to the protocol used in
the pilot study. A new set of items were used. Since the mean time taken by pilot study
participants to complete the survey was 13.65 minutes, with a standard deviation of
4.87 minutes, an additional hard justification was added to the test items. Furthermore,
all of the items with easy justifications ranked easy were drawn from the highest easy
complexity bin (bin 600-800). In the pilot study, we observed that the lower ranking
easy items were found to be quite easy and, by inspection of their bins, we found that
it was quite likely to draw similar justifications. The third bin (600-800) is much larger
and logically diverse, thus is more challenging for the model.</p>
        <p>The series consisted of 3 practice items followed by 6 additional items, 3 easy
items(EM1, EM2 and EM3 of complexities: 654, 703, and 675), and 3 hard items
(HM1, HM2 and HM3 of complexities: 1380, 1395, and 1406). The items were
randomly ordered for each participant. Again, the expectation of the time to complete the
study was a maximum of 30 minutes, including orientation, practice items and brief
demographic questionnaire.</p>
        <p>
          Results Errors and times are presented in Table 2(b). The coding to error is the same
as in the pilot. An analysis with Cochran’s Q Test across all items reveals a significant
difference in error proportion [Q(
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) = 15:095; p = 0:0045].
        </p>
        <p>
          A pairwise analysis between easy and hard items reveals that there are significant
and, highly significant, differences in errors between EM1/HM1 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 4:50, p =
0:034], EM1/HM2 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 7:00, p = 0:008], EM2/HM1 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 4:50, p = 0:034],
EM2/HM2 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 5:44, p = 0:02], and EM3/HM2 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 5:44, p = 0:02].
        </p>
        <p>
          However, there were no significant differences between EM1/HM3 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 0:00,
p = 1:00], EM2/HM3 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 0:00, p = 1:00], EM3/HM3 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 2:00, p = 0:16]
and EM3/HM1 [Q(
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) = 0:67, p = 0:41].
        </p>
        <p>With regards to the nonsignificant differences between certain easy and hard items,
there are two items which stand out: An easy item EM3 and a hard item HM3, which
are shown in Figure 2.</p>
        <p>In line with the results from the pilot study, an analysis of times using a paired
samples t-test revealed significant differences between some easy and hard items, with
those easy times being significantly less than the hard times EM1/HM1 [p = 0:023],
EM2/HM2 [p = 0:016] and EM3/HM1 [p = 0:025]. However, for other pairs of
easy and hard items, times were not significantly different: EM1/HM1 [p = 0:43],
EM2/HM1 [p = 0:11] and EM3/HM2 [p = 0:10]. Again, time is not a reliable
predictor of model complexity.</p>
        <p>Anomalies in Experiment 1: Two items (EM3 and HM3) did not exhibit their
predicted error rate relations. For item EM3, we conjectured that a certain pattern of
superfluous axiom parts in the item (not recognisable by the model) made it harder than
the model predicted. That is, that the model was wrong.</p>
        <p>For item HM3 we conjectured that the model correctly identifies this item as hard,10
but that the MSc students answered “Yes” because of misleading pattern of axioms at
the start and end of item HM3. The high “success” rate was due to an error in reasoning,
that is, a failure in understanding.</p>
        <p>In order to determine whether our conjectures were possible and reasonable, we
conducted a follow-up study with the goal of observing the conjectured behaviours in
situ. Note that this study does not explain what happened in Experiment 1.
4.6</p>
      </sec>
      <sec id="sec-4-5">
        <title>Experiment 2</title>
        <p>Participants: Two CS Research Associates and one CS PhD student, none of whom
had taken part in the pilot study. All participants were very experienced with OWL.
Materials and procedures: Items and protocol were exactly the same as Experiment
1, with the addition of the think aloud protocol. Furthermore, the screen, participant
vocalization, and eye tracking were recorded.</p>
        <p>Results: With regard to EM3, think aloud revealed that all participants were distracted
by the superfluous axiom parts in item EM3. Figure 2 shows an eye tracker heat map
for the most extreme case of distraction in item EM3. As can be seen, hot spots lie over
the superfluous parts of axioms. Think aloud revealed that all participants initially tried
to see how the 9prop1:C6 conjunct in the third axiom contributed to the entailment and
struggled when they realised that this was not the case.</p>
        <p>In the case of HM3, think aloud revealed that none of the participants understood
how the entailment followed from the set of axioms. However, two of them responded
correctly and stated that the entailment did hold. As conjectured, the patterns formed
by the start and end axioms in the item set seemed to mislead them. In particular,
when disregarding quantifiers, the start axiom C1 v 8prop1:C3 and the end axiom
C2 v 9prop1:C3 t : : : look very similar. One participant spotted this similarity and
claimed that the entailment held as a result. Hot spots occur over the final axiom and
the first axiom in the eye tracker heat map (Figure 2), with relatively little activity in
the axioms in the middle of the justification.
10 It had been observed to stymie experienced modellers in the field. Furthermore, it involves
deriving a synonym for &gt;, which was not a move this cohort had experience with.
In this paper we presented a methodology for validating the predicted cognitive
complexity of justifications. The main advantages of the experimental protocol used in the
methodology is that minimal study facilitator intervention is required. This means that,
over time, it should be possible to collect rich and varied data fairly cheaply and from
geographically distributed participants. In addition to this, given a justification corpus
and population of interest, the main experiment is easily repeatable with minimal
resources and setup. Care must be taken in interpreting results and, in particular, the
protocol is weak on “too hard” justifications as it cannot distinguish a model mislabeling
from people failing for the wrong reason.</p>
        <p>The cognitive complexity model that was presented in this paper fared reasonably
well. In most cases, there was a significant difference in error proportion between model
ranked easy and hard justifications. In the cases where error proportions revealed no
difference better than chance, further small scale follow-up studies in the form of a more
expensive talk-aloud study was used to gain an insight into the problems. These
inspections highlighted an area for model improvement, namely in the area of superfluity. It
is unclear how to rectify this in the model, as there could be justifications with
superfluous parts that are trivial to understand, but the location and shape of superfluity seem
an important factor.</p>
        <p>The refinement and validation of our model is an ongoing task and will require
considerably more experimental cycles. We plan to conduct a series of experiments
with different cohorts as well as with an expanded corpus. We also plan to continue the
analysis of our corpus with an eye to performing experiments to validate the model over
the whole (for some given population).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M.</given-names>
            <surname>Horridge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>U.</given-names>
            <surname>Sattler</surname>
          </string-name>
          .
          <article-title>Lemmas for justifications in OWL</article-title>
          .
          <source>In DL</source>
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Johnson-Laird</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. M. J.</given-names>
            <surname>Byrne</surname>
          </string-name>
          . Deduction. Psychology Press,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Kalyanpur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          , E. Sirin, and
          <string-name>
            <given-names>B.</given-names>
            <surname>Grau</surname>
          </string-name>
          .
          <article-title>Repairing unsatisfiable concepts in OWL ontologies</article-title>
          .
          <source>In Proc. of ESWC-06</source>
          , pages
          <fpage>170</fpage>
          -
          <lpage>184</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Kalyanpur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          , E. Sirin, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hendler</surname>
          </string-name>
          .
          <article-title>Debugging unsatisfiable classes in OWL ontologies</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <volume>3</volume>
          (
          <issue>4</issue>
          ),
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kazakov</surname>
          </string-name>
          .
          <article-title>RIQ and SROIQ are harder than SHOIQ</article-title>
          .
          <source>In Proc. KR</source>
          <year>2008</year>
          . AAAI Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>S. C. J.</given-names>
            <surname>Lam</surname>
          </string-name>
          .
          <article-title>Methods for Resolving Inconsistencies In Ontologies</article-title>
          .
          <source>PhD thesis</source>
          , Department of Computer Science, Aberdeen,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Newstead</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Handley</surname>
          </string-name>
          , I. Dennis, and
          <string-name>
            <given-names>J. S. B.</given-names>
            <surname>Evans</surname>
          </string-name>
          .
          <article-title>Predicting the difficulty of complex logical reasoning problems</article-title>
          . Psychology Press,
          <volume>12</volume>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Rips</surname>
          </string-name>
          . The Psychology of Proof. MIT Press, Cambridge, MA,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>C.</given-names>
            <surname>Roussey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Corcho</surname>
          </string-name>
          , and
          <string-name>
            <surname>L.</surname>
          </string-name>
          <article-title>Vilches-Bla´zquez. A catalogue of OWL ontology antipatterns</article-title>
          .
          <source>In Proc. of K-CAP-09</source>
          , pages
          <fpage>205</fpage>
          -
          <lpage>206</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. G. Strube.
          <article-title>The role of cognitive science in knowledge engineering</article-title>
          .
          <source>In Contemporary Knowledge Engineering and Cognition</source>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>