<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Empirical Robustness of Description Logic Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rafael S. Gonc¸alves</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Matentzoglu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bijan Parsia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Uli Sattler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science, University of Manchester</institution>
          ,
          <addr-line>Manchester</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In spite of the recent renaissance in lightweight description logics (DLs), many prominent DLs, such as that underlying the Web Ontology Language (OWL), have high worst case complexity for their key inference services. Modern reasoners have a large array of optimization, tuned calculi, and implementation tricks that allow them to perform very well in a variety of application scenarios even though the complexity results ensure that they will perform poorly for some inputs. For users, the key question is how often they will encounter those pathological inputs in practice, that is, how robust are reasoners. We attempt to determine this question for classification of existing ontologies as they are found on the Web. It is a fairly common user task to examine ontologies published on the Web as part of their development process. Thus, the robustness of reasoners in this scenario is both directly interesting and provides some hints toward answering the broader question. From our experiments, we show that the current crop of OWL reasoners, in collaboration, is very robust against the Web.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Motivation</title>
      <p>
        A serious concern about both versions 1 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and 2 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] of the Web Ontology
Language (OWL) is that the underlying description logics (SHOIQ and SROIQ)
exhibit extremely bad worst case complexity (NEXPTIME and 2NEXPTIME) for their
key inference services. While since the mid-1990s, highly optimized description logic
reasoners have been exhibiting rather good performance in real cases, even in those
more constrained cases there are ontologies (such as Galen) which have proved
impossible to process for over a decade. Indeed, concern with such pathology stimulated a
renaissance of research into tractable description logics with the E L family [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and the
DL Lite [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] family being incorporated as special “profiles” of OWL 2. However, even
though the number of ontologies available on the Web has grown enormously since the
standardization of OWL, it is still unclear how robust modern, highly optimized
reasoners are to such input. Anecdotal evidence suggests that pathological cases are common
enough to cause problems, however, systematic evidence has been scarce.
      </p>
      <p>
        In this paper we investigate the question of whether modern, highly-optimized
description logic reasoners are robust over Web input. The general intuition of a robust
system is that it is resistant to failure in the face of a range of input. For any particular
robustness determination, one must decide: 1) the range of input, 2) the functional or
non-functional properties of interest, and 3) what counts as failure. The instantiation
of these parameters strongly influences robustness judgements, with the very same
reasoner being highly robust under one scenario and very non-robust under another. For our
current purposes, the key scenario is that an ontology engineer, using a tool like Prote´ge´
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], is inspecting ontologies published on the Web with an eye to possible reuse, and,
as is common, they wish to classify the ontology using a standard OWL 2 DL reasoner
as part of their evaluation. This scenario yields the following constraints: 1) for input,
we examine Web-based corpora, 2) functional: acceptance (will the reasoner load and
process the ontology); non-functional: performance (i.e., will the reasoner complete
classification before the ontology engineer gives up), 3) w.r.t. acceptance, failure means
either rejecting the input or crashing while processing, and we might reasonably
expect an engineer to wait up to 2 hours if the ontology seems “worth it”. If a reasoner
(or a set of reasoners) is successful for 90% of a corpus, we count that reasoner as
robust over that corpus, with 95% and 99% indicating “strong” and “extreme” robustness.
While these levels are clearly arbitrary (as is the timeout), they provide a framework to
set expectations. Robustness under these assumptions does not ensure robustness under
other assumptions (e.g., over subsets of these ontologies as experienced during
development or over a more stringent time constraint), yet they are challenging enough that
it was unclear to us ex ante whether any reasoners would be robust for any corpus. In
fact, we find that the reasoners are robust or near robust for most of the cases we
examine including for lower timeouts. More significantly, if we take the best result for
each ontology (which represents a kind of “meta-reasoner”, where our test reasoners
are run in parallel), then the set of reasoners is extremely robust over all corpora. Thus,
in a fairly precise, if limited, sense, we demonstrate that SHOIQ and SROIQ are
practical description logics.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Materials &amp; Methods</title>
      <p>For our input data, we gathered three sets of ontologies from the Web — all versions of
the NCI Thesaurus (NCIt), ontologies in the NCBO Bioportal repository, and the results
of a Web crawl, each with fundamentally different characteristics.</p>
      <p>
        The NCIt has been continuously developed and published in monthly versions since
2003. The NCIt archive1 contains 106 versions parseable by the OWL API [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],2 from
release 02.00 (October 2003) through to release 12.11d (November 2012) ranging in
size from 49,475 to 133,900 logical axioms and in expressivity from ALE to SH(D).
The NCIt team is a fairly stable, closed team of about 20 ontology developers who use a
highly regimented process and, at least since 2006, have incorporated OWL reasoners in
their tools chain (namely FaCT++ and Pellet). The NCIt is large and easily accessible,
thus has been an informal benchmark for reasoner developers. Additionally, the NCI
has funded various infrastructure projects, including improvements to reasoners. Thus,
we might reasonably expect that reasoners are robust w.r.t. this corpus, both because the
NCI team may be tuning their ontology to the available reasoners (though the fact that
they fund improvements suggests not), and because reasoner developers are tuning for
NCIt.
      </p>
      <p>The NCBO Bioportal is a Web based repository for health care and life science
ontologies. We use a snapshot of (publicly downloadable ontologies from) the BioPortal
1 http://evs.nci.nih.gov/ftp1/NCI_Thesaurus
2 http://owlapi.sourceforge.net
repository from November 2012, consisting of 292 OWL and OBO parseable
ontologies. The average number of logical axioms in the corpus is 28,439 (total: 8,190,504
and median: 979 axioms), and 89 of these ontologies contain named individuals. 4
ontologies contained no logical axioms at all and thus were discarded. In expressivity, the
ontologies range from the inexpressive AL DL to the very expressive SROIQ. The
ontologies are developed and used in a wide range of largely unrelated projects for a
variety of purposes using a variety of tools. While Bioportal has received some attention
from the research community, it is not yet a standard target for reasoner developers.</p>
      <p>The third corpus, obtained by a short Web crawl and fuelled by a high number
of seeds from Swoogle, Google and ontology repositories on the Web, was collected
in November 2012. We picked a random sample of 822 ontologies, out of which 145
contained no logical axioms at all and thus were discarded, leaving 677 ontologies for
our experiment. The average number of logical axioms is 2,405 (total: 1,628,207 and
median: 57), and the expressivity ranges from AL to SRIQ. These ontologies span
a wide range of subjects and are completely uncontrolled with respect to their origin.
Perhaps not surprisingly, there are fewer axioms overall and on average, with half of
the ontologies containing under 60 axioms. This may reflect less commitment to the
ontologies than we see in the more curated set. However, there is no reason to think that
the reasoners have been specially tuned to these ontologies and, given the worst case
complexity of the logics, even small ontologies are a potential pathological case. Thus,
it is unclear what the rational robustness expectation is for this set.</p>
      <p>
        We selected four reasoners for testing based on the following criteria: a) coverage
of all of OWL 2, b) freely available for download, c) native support for the OWL API,
and finally d) based on sound, complete and terminating algorithms. As such, the
chosen reasoners are Pellet [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], HermiT [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], FaCT++ [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], and JFact. We excluded, e.g.,
the RacerPro [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], CB [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], and KAON23 reasoners due to their lack of coverage for all
OWL 2 features, and no native support for the OWL API. Finally, we did not consider
approximate (either unsound or incomplete) reasoners, such as TrOWL [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], so that
we can compare classification results between reasoners, and because we feel that
approximation is generally only considered in cases where sound and complete reasoners
fail.
      </p>
      <p>
        For all our experiments we use the current 2013 reasoner versions, namely, Pellet
v2.3.0, HermiT v1.3.6, FaCT++ v1.6.1 and JFact v1.0. However, since NCIt
performance has been studied previously [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we decided to compare the current reasoner
versions with the versions used in the 2011 study, namely Pellet v2.2.2, HermiT v1.3.3,
FaCT++ v1.5.3 and JFact v0.2, in order to test how much tuning to NCIt occurs.
      </p>
      <p>As mentioned earlier, we set the classification timeout tp 2 hours per
ontologyreasoner pair. From a scenario perspective, 2 hours is rather generous – many
ontologists will give up much sooner than that. However, 2 hours gives us an idea of which
“hard” ontologies are clearly in striking distance, without making completing the
experiments infeasible. In the presentation below, we examine a tighter timeout (and thus
harder robustness criterion) of about 100 seconds. The main experiment machine has
an Intel Quad-Core Xeon 3.2GHz processor with 32GB DDR3 RAM. A second
experiment involving solely the NCIt (and both reasoner sets) was performed on a machine</p>
      <sec id="sec-2-1">
        <title>3 http://kaon2.semanticweb.org</title>
        <p>with an Intel Dual-Core i7 2.7GHz processor, with 16GB DDR3 RAM. All tests were
run on Mac OS X 10.7.5, using Java v1.7 and the OWL API v3.4.1.</p>
        <p>The test corpora, experiment results, and reasoners used are available from http:
//sites.google.com/site/reasonerbenchmark.4
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>In all experiments we categorise ontology classification times into the following bins:
Very Easy ( 1 second), Easy (1-10 seconds), Medium (10-100 seconds), Hard
(1001000 seconds), and Very Hard (&gt;1000 seconds). We denote “Impatient Robustness” as
a measure of how many ontologies terminate in an acceptable time for most users, i.e.,
ontologies in the Medium bin or below. Throughout this section we use “Best Combo”
as the best of all of 4 reasoners’ results (i.e., fastest time), and, analogously, “Worst
Combo” as the worst.
3.1</p>
      <p>NCI Thesaurus
In this experiment we test both 2011 and 2013 reasoner versions sets, and compare
the performance behaviour of each reasoner. The classification times for both reasoner
sets are shown in Figure 1. Using the reasoner versions from 2011, and taking into
account those ontologies that all reasoners managed to process and classify, FaCT++ is
on average the fastest of all 4 reasoners, taking 14.7 seconds per version. JFact comes
second, with an average of 22.9 seconds per version, while Pellet is the third fastest,
taking on average 36.5 seconds per version, and finally HermiT is the slowest, with 150
seconds per version (see Table 1).</p>
      <p>When switching to the 2013 reasoner sets the performance winner remains FaCT++,
with an average of 19.2 seconds per ontology. However in second place now comes
Pellet, taking 61.6 seconds on average, in third HermiT with an average of 174 seconds,
and finally JFact taking 180 seconds on average per ontology. Notice that from 2011 to
2013 there was a significant improvement in JFact’s robustness, with far fewer errors.
Similarly Pellet has less errors in its most recent version, and the performance is
superior to the 2011 version. FaCT++ and HermiT’s performance slightly decreased from
2011 to 2013 on this corpus, though not nearly as much as JFact, possibly because in
its most recent version JFact is able to process the more recent versions of the NCIt.</p>
      <p>Overall, FaCT++ is the most robust reasoner for the NCIt corpus, having no errors
in either 2011 or 2013 versions (see Tables 1 and 2). Furthermore, it is the fastest
performing reasoner across nearly all versions. The least robust is, interestingly, FaCT++’s
port to the Java language: JFact, due to the high number of errors reported. Though there
is improvement from 2011 to 2013, this “young” reasoner is still not as fast as FaCT++.
The reasoner errors encountered throughout the NCIt were, by Pellet: “OutOfMemory”
errors, by HermiT: “StackOverflow”, and finally by JFact: “IllegalArgument”.
4 Full set of crawled ontologies is available upon request.
600 
500 
400 
300 
200 
100 
Pellet'13 
Pellet'11 
HermiT'13 
HermiT'11 
FaCT++'13 
FaCT++'11 
JFact'13 
JFact'11 
In the snapshot of BioPortal, out of all 288 non-empty ontologies, 9 ontologies are
inconsistent and there are 234 that all reasoners manage to classify within the timeout
(see Table 3). Out of those ontologies where all reasoners completed classification,
FaCT++ was on average the fastest (2.9 seconds per ontology), followed by JFact (5.9
seconds), HermiT (9.8 seconds), and finally Pellet (16.7 seconds). However, in terms of
robustness, our results show that Pellet is the most robust of all reasoners, only failing
to handle 9 ontologies, while HermiT, the least robust, fails to classify 27 ontologies.
FaCT++ and JFact fail to handle 24 and 25 ontologies respectively (see Table 4 for more
details regarding errors).</p>
      <p>Generally, Pellet is not only the most robust reasoner for BioPortal, with fewer
errors, but also exhibits fast performance on a high number of ontologies. However,
it does have the most timeouts; but note that some of these were on ontologies that
other reasoners threw an error on. The remaining 3 reasoners are very close to each
other performance and robustness-wise, HermiT with less timeouts but more errors than
JFact and FaCT++, and slower performance. Thus HermiT is the least robust reasoner
for BioPortal.</p>
      <sec id="sec-3-1">
        <title>Very Easy Easy Medium Hard</title>
        <p>Very Hard
Timeout</p>
        <p>Errors
Impatient Robustness</p>
        <p>Pellet HermiT JFact FaCT++ Best Combo Worst Combo
190 (66%) 170 (59%) 184 (64%) 218 (76%) 236 (82%) 152 (53%)
56 (19%) 61 (21%) 58 (20%) 24 (8%) 28 (10%) 58 (20%)
10 (3%) 15 (5%) 8 (3%) 7 (2%) 11 (4%) 10 (3%)
4 (1%) 4 (1%) 2 (1%) 2 (1%) 4 (1%) 2 (1%)
6 (2%) 3 (1%) 0 (0%) 3 (1%) 4 (1%) 2 (1%)
13 (5%)
9 (3%)
89%
8 (3%)
27 (9%)
85%
11 (4%)
25 (9%)
87%
10 (3%)
24 (8%)
86%
5 (2%)
Out of the 677 non-empty ontologies from the Web crawl corpus, all reasoners
completed classification of 560 of them. In these 560, Pellet was the fastest reasoner on
average (0.5 seconds per ontology), followed by FaCT++ (1.5 seconds), HermiT (3.1
seconds), and finally JFact (6.2 seconds). In terms of robustness, Pellet is, again, the
most robust, having only thrown errors on 17 ontologies (see Table 5). It is also the
reasoner with most timeouts, but again, several times where other reasoners threw errors.
FaCT++ and HermiT both have a high number of errors, while, curiously, JFact did
much better on that front in this corpus. In Table 6 the errors found across the corpus
are broken down.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Very Easy Easy Medium Hard</title>
        <p>Very Hard
Timeout</p>
        <p>Pellet HermiT JFact FaCT++ Best Combo Worst Combo
597 (88%) 536 (79%) 557 (82%) 566 (84%) 642 (95%) 493 (73%)
44 (6%) 36 (5%) 45 (7%) 12 (2%) 26 (4%) 44 (6%)
2 (0%) 8 (1%) 11 (2%) 0 (0%) 3 (0%) 12 (2%)
1 (0%) 1 (0%) 4 (1%) 5 (1%) 2 (0%) 3 (0%)
0 (0%) 1 (0%) 1 (0%) 1 (0%) 0 (0%) 1 (0%)
16 (2%)
6 (1%)
5 (1%)
5 (1%)
Reasoner Errors</p>
        <p>17 (3%) 89 (13%) 54 (8%) 88 (13%)
Impatient Robustness
95%
86%
91%
85%</p>
        <p>Overall Pellet is the most robust and fastest (among ontologies that could be
classified by all reasoners) reasoner for this corpus, followed closely by JFact, both in terms
of robustness and performance. The least robust reasoners for the Web crawl corpus are
FaCT++ and HermiT, with 88 and 89 errors, respectively. However, HermiT performed
slightly better on the lower bins, while FaCT++ was clearly the slowest in this corpus.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>Overall we have processed a total of 1,071 ontologies, the largest such reasoner
benchmark, having found that amongst the 4 tested reasoners Pellet is the most robust of all
(see Table 7). Surprisingly, Pellet is followed by JFact on our robustness test, due to
having far less errors than FaCT++. HermiT and FaCT++ have the same overall
robustness, but FaCT++ has less errors and higher impatient robustness.</p>
      <p>While Pellet is the most robust reasoner, we urge some caution in that reading. In
particular, this does not mean that Pellet will always do best or even perform reasonably.
In fact, it may timeout where other reasoners finish reasonably fast. The set of reasoners
(taken together and considering the best results) is extremely robust across the board
(for each reasoner’s contribution to the best case reasoner, see Figure 2). Thus, we
have strong empirical evidence that the ontologies on the Web do not supply many in
principle intractable cases, but only cases which are difficult for particular reasoners.
s 
e
i
g
o
l
o
t
n
.rO 
N</p>
      <p>Fig. 2. Number of times that each reasoner equals the best case, for each corpus.</p>
      <p>Note that FaCT++ and JFact fail to process several ontologies due to poor support
for OWL datatypes, particularly datatypes not specified in the OWL 2 datatype map;
both of these reasoners, as well as HermiT, have little support for OWL 1 datatypes.
By removing the non OWL 2 datatype errors, we would end up with FaCT++ being the
most robust w.r.t. OWL 2, followed by HermiT and Pellet. From Figure 2 we see that
FaCT++ outperforms other reasoners on many occasions, but, due to the high number
of errors thrown, its robustness w.r.t. our input data is not nearly at the same level as its
performance.</p>
      <p>The 9 ontologies which no reasoner classified within the timeout range in
expressivity between ALE HIF + and SRIQ. Their average number of logical axioms is
56,179; the minimum is 341 axioms - SRIQ ontology, maximum 379,734 axioms
SR ontology, and median 17,385 axioms - SHIF ontology.</p>
      <p>It is clear that deriving a sensible ranking even simply using average or total time
is not straightforward. Our results have rather strong implications for reasoner
experiments, especially those purporting to show the advantages of an optimisation or a
technique or an implementation: The space is very complex and it is very easy to
simultaneously generate a biased sample for one system and against another. Even simple,
seemingly innocuous things like timeouts and classification failures require tremendous
care in handling. If results are going to be meaningful across papers we need to converge
on experimental inputs, methods, and reporting forms.</p>
      <p>Finally, in order to get an overall picture of how these robustness measurements
relate to the OWL profile in which ontologies fit into, we: divide our ontologies into
their corresponding OWL profile, and match them with the observed performance bin
of the Best and Worst Combo reasoners. This is displayed in Figure 3.</p>
      <p>Since there is an overlap between the EL, RL and QL profiles of OWL 2, some
ontologies are counted in more than one such bin, meaning that the total number of
ontologies in Figure 3 does not add up to the number of ontologies in our corpus.
However, the ontologies contained in the DL profile bin are exclusive, i.e., an ontology in the
EL profile is not counted again within the DL profile. Note that, even though ontologies
in the EL, RL and QL profiles of OWL are typically in the easier bins, there are some
which are deemed hard, time out, or even result in error.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Related Work</title>
      <p>
        There is extensive work in benchmarking reasoners, some of which focuses purely on
either classification or (conjunctive) query answering (e.g., [
        <xref ref-type="bibr" rid="ref17 ref8">8,17</xref>
        ]). Generally, previous
reasoner benchmarks used much smaller and rather ad hoc data sets, in some cases
using artificial data. For the purposes of this paper, we focus solely on work involving
the classification task, particularly using realistic rather than artificially-generated test
data.
      </p>
      <p>
        The Pellet reasoner was evaluated, in [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], with a corpus of 9 ontologies, presenting
the average of 10 independent runs of a reasoning task - the tasks under test being
consistency checking, classification and realization. Additionally, the authors compare
Pellet against FaCT++ and RacerPro in terms of classification time only, using the DL
benchmark test suite described in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The experiment showed that Pellet was not as
efficient as FaCT++ or RacerPro in many, but not all, cases.
5  3  2  1  4 
3 
8 
177 
      </p>
      <p>EL 
QL 
RL 
DL 
EL 
QL 
RL 
DL 
600 
500 
400 
300 
200 
100 
0 </p>
      <sec id="sec-5-1">
        <title>5 http://obofoundry.org</title>
        <p>
          In [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] the authors present a system for comparing reasoners both in terms of
performance and correctness of classification results. Four reasoners are put to test: FaCT++,
Pellet, KAON2 and RacerPro, over a corpus of 172 naturally occurring ontologies, out
of which only 31 were either in or more expressive than ALC. The benchmark results
show that Pellet was the most robust reasoner, with FaCT++ a close second, being able
to process, respectively, 143 and 137 ontologies. In terms of classification time, the
authors state that “there is no clear winner”, due to considerable fluctuation of reasoner
performance across ontologies.
        </p>
        <p>
          The evaluation of the HermiT reasoner [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] was carried out against the FaCT++
and Pellet reasoners, using a corpus of ontologies derived from the Gardiner data set
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], the Open Biological Ontologies (OBO) Foundry,5 and finally, several versions of
the GALEN ontology. The result was that HermiT outperforms the other reasoners in
the majority of tested ontologies.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] the authors carry out a benchmark of ontologies derived from the Watson
repository.6 Out of the 6,224 ontologies in Watson, only 3,303 were parseable by
both Swoop and the KAON2 tools. These were then classified into 4 bins according
to their expressivity; RDFS(DL), OWL DLP, OWL Lite, and OWL DL. From these
bins, the authors picked 1 representative per bin, according to its popularity in previous
benchmarks. The test itself involved the reasoners HermiT, Pellet, RacerPro, KAON2,
OWLIM and Sesame, where the classification performance results show that HermiT
was fastest in 3 out 4 cases, OWLIM being the fastest in the RDFS(DL) representative.
        </p>
        <p>
          The author of [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] performs a benchmark of the Pellet, FaCT++ and Racer
reasoners, though using different interfaces (FaCT++ used DIG at the time) - thus the
results are not directly comparable. This benchmark was carried out using a corpus of
135 OWL ontologies from Schemaweb.7 The experiment showed that FaCT++ was the
fastest (excluding timeouts) and the most robust, since it processed the most ontologies
without timing-out or aborting (due to errors unrevealed by the author).
        </p>
        <p>
          The benchmark carried out in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] compared the KAON2, Pellet, Racer, HermiT and
FaCT++ reasoners, against 50 naturally occurring ontologies. However, in the paper, the
authors focus only on a few examples; Racer was fastest on the Wine ontology, HermiT
on DeepTree, FaCT++ on the NCI Thesaurus, and HermiT on GALEN.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Future Work</title>
      <p>In this paper, we did not have space to discuss whether there is a performance/size
or performance/expressivity correlation. By and large, our analysis shows that there
is a roughly linear correlation between performance and size, and no correlation with
expressivity.</p>
      <p>Due to the large size of the Web crawl corpus, we resorted to sampling in order to
obtain results in time. Though we have tested large enough samples to attain statistical
significance, we hope to complete processing all ontologies in said corpus in the near
future. For the purposes of this paper we limited our attention to classification, but could
easily extend our benchmarking to other inference problems, even to non-standard ones
such as justification finding. We also intend to tackle the vast task of identifying
promising correlations between features of ontologies and their reasoning difficulty.</p>
      <p>To address the difficulties in stable, cross-experiment comparison and
interpretation, we propose to establish a comprehensive benchmark which is updated yearly. To
facilitate rapid experimentation, we will provide canonical stable random samples so
that experimenters can provide a comparable baseline, even if for scientific reasons
they must also investigate other inputs. We will also make our test framework and
computing platform available, re-running all the experiments we can gather in the prior year
to provide systematic review and replication of results.</p>
      <sec id="sec-6-1">
        <title>6 http://watson.kmi.open.ac.uk/WatsonWUI/</title>
      </sec>
      <sec id="sec-6-2">
        <title>7 http://schemaweb.info/</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baader</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brandt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lutz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Pushing the EL envelope</article-title>
          .
          <source>In: Proc. of the 19th Int. Joint Conf. on Artificial Intelligence (IJCAI-05)</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Babik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hluchy</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>A testing framework for OWL-DL reasoning</article-title>
          .
          <source>In: Proc. of the Int. Conf. on Semantics, Knowledge and Grids (SKG-08)</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bock</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haase</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Volz</surname>
          </string-name>
          , R.:
          <article-title>Benchmarking owl reasoners</article-title>
          .
          <source>In: Proc. of the Int. Workshop on Advancing Reasoning on the Web: Scalability and Commonsense (ARea-08)</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Calvanese</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Giacomo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lembo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenzerini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosati</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Tractable reasoning and efficient query answering in description logics: The DL-Lite family</article-title>
          .
          <source>J. of Automated Reasoning</source>
          <volume>39</volume>
          (
          <issue>3</issue>
          ),
          <fpage>385</fpage>
          -
          <lpage>429</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Cuenca</given-names>
            <surname>Grau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Horrocks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Motik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Parsia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Patel-Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.F.</given-names>
            ,
            <surname>Sattler</surname>
          </string-name>
          ,
          <string-name>
            <surname>U.</surname>
          </string-name>
          :
          <article-title>OWL 2: The next step for OWL</article-title>
          .
          <source>J. of Web Semantics</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gardiner</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsarkov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Framework for an automated comparison of description logic reasoners</article-title>
          .
          <source>In: Proc. of the 5th Int. Semantic Web Conf. (ISWC-06)</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Gonc¸alves, R.S.,
          <string-name>
            <surname>Parsia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Analysing the evolution of the NCI thesaurus</article-title>
          .
          <source>In: Proc. of the 24th IEEE Int. Symposium on Computer-Based Medical Systems (CBMS-11)</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heflin</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>LUBM: A benchmark for OWL knowledge base systems</article-title>
          .
          <source>J. of Web Semantics</source>
          <volume>3</volume>
          (
          <issue>2-3</issue>
          ),
          <fpage>158</fpage>
          -
          <lpage>182</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Haarslev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , Mo¨ller, R.:
          <article-title>RACER system description</article-title>
          .
          <source>In: Proc. of the 1st Int. Joint Conf. on Automated Reasoning (IJCAR-01). Lecture Notes in Artificial Intelligence</source>
          , vol.
          <source>2083</source>
          . Springer-Verlag (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Horridge</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bechhofer</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The OWL API: A Java API for working with OWL 2 ontologies</article-title>
          .
          <source>In: Proc. of the 6th Int. Workshop on OWL: Experiences and Directions (OWLED-09)</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel-Schneider</surname>
            ,
            <given-names>P.F.</given-names>
          </string-name>
          :
          <article-title>DL systems comparison</article-title>
          .
          <source>In: Proc. of the 11th Int. Workshop on Description Logics (DL-98)</source>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patel-Schneider</surname>
            ,
            <given-names>P.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>van Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <string-name>
            <surname>From</surname>
            <given-names>SHIQ</given-names>
          </string-name>
          and
          <article-title>RDF to OWL: The making of a web ontology language</article-title>
          .
          <source>J. of Web Semantics</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <fpage>7</fpage>
          -
          <lpage>26</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kazakov</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Consequence-driven reasoning for Horn SHIQ ontologies</article-title>
          .
          <source>In: Proc. of the 21st Int. Joint Conf. on Artificial Intelligence (IJCAI-09)</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Knublauch</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fergerson</surname>
            ,
            <given-names>R.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Musen</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>The Prote´ge´ OWL plugin: An open development environment for semantic web applications</article-title>
          .
          <source>In: Proc. of the 3rd Int. Semantic Web Conf. (ISWC-04)</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Benchmarking DL reasoners using realistic ontologies</article-title>
          .
          <source>In: Proc. of the 1st Int. Workshop on OWL: Experiences and Directions (OWLED-05)</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>J.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Soundness preserving approximation for tbox reasoning</article-title>
          .
          <source>In: Proc. of the 24th AAAI Conf. on Artificial Intelligence (AAAI-10)</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motik</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>A comparison of reasoning techniques for querying large description logic aboxes</article-title>
          .
          <source>In: Proc. of the 13th Int. Conf. on Logic for Programming and Automated Reasoning (LPAR-06)</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Shearer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motik</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
          </string-name>
          , I.:
          <article-title>HermiT: A highly-efficient OWL reasoner</article-title>
          .
          <source>In: Proc. of the 5th Int. Workshop on OWL: Experiences and Directions (OWLED-08EU)</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Sirin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuenca Grau</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalyanpur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katz</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Pellet: A practical OWL-DL reasoner</article-title>
          .
          <source>J. of Web Semantics</source>
          <volume>5</volume>
          (
          <issue>2</issue>
          ),
          <fpage>51</fpage>
          -
          <lpage>53</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Tsarkov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
          </string-name>
          , I.:
          <article-title>FaCT++ description logic reasoner: System description</article-title>
          .
          <source>In: Proc. of the 3rd Int. Joint Conf. on Automated Reasoning (IJCAR-06)</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>