The Empirical Robustness of Description Logic
                      Classification

        Rafael S. Gonçalves, Nicolas Matentzoglu, Bijan Parsia, and Uli Sattler

    School of Computer Science, University of Manchester, Manchester, United Kingdom


       Abstract. In spite of the recent renaissance in lightweight description logics
       (DLs), many prominent DLs, such as that underlying the Web Ontology Lan-
       guage (OWL), have high worst case complexity for their key inference services.
       Modern reasoners have a large array of optimization, tuned calculi, and imple-
       mentation tricks that allow them to perform very well in a variety of application
       scenarios even though the complexity results ensure that they will perform poorly
       for some inputs. For users, the key question is how often they will encounter those
       pathological inputs in practice, that is, how robust are reasoners. We attempt to
       determine this question for classification of existing ontologies as they are found
       on the Web. It is a fairly common user task to examine ontologies published on
       the Web as part of their development process. Thus, the robustness of reasoners in
       this scenario is both directly interesting and provides some hints toward answer-
       ing the broader question. From our experiments, we show that the current crop of
       OWL reasoners, in collaboration, is very robust against the Web.


1   Motivation
A serious concern about both versions 1 [12] and 2 [5] of the Web Ontology Lan-
guage (OWL) is that the underlying description logics (SHOIQ and SROIQ) ex-
hibit extremely bad worst case complexity (NEXPTIME and 2NEXPTIME) for their
key inference services. While since the mid-1990s, highly optimized description logic
reasoners have been exhibiting rather good performance in real cases, even in those
more constrained cases there are ontologies (such as Galen) which have proved impos-
sible to process for over a decade. Indeed, concern with such pathology stimulated a
renaissance of research into tractable description logics with the EL family [1] and the
DL Lite [4] family being incorporated as special “profiles” of OWL 2. However, even
though the number of ontologies available on the Web has grown enormously since the
standardization of OWL, it is still unclear how robust modern, highly optimized reason-
ers are to such input. Anecdotal evidence suggests that pathological cases are common
enough to cause problems, however, systematic evidence has been scarce.
    In this paper we investigate the question of whether modern, highly-optimized de-
scription logic reasoners are robust over Web input. The general intuition of a robust
system is that it is resistant to failure in the face of a range of input. For any particular
robustness determination, one must decide: 1) the range of input, 2) the functional or
non-functional properties of interest, and 3) what counts as failure. The instantiation
of these parameters strongly influences robustness judgements, with the very same rea-
soner being highly robust under one scenario and very non-robust under another. For our
current purposes, the key scenario is that an ontology engineer, using a tool like Protégé
[14], is inspecting ontologies published on the Web with an eye to possible reuse, and,
as is common, they wish to classify the ontology using a standard OWL 2 DL reasoner
as part of their evaluation. This scenario yields the following constraints: 1) for input,
we examine Web-based corpora, 2) functional: acceptance (will the reasoner load and
process the ontology); non-functional: performance (i.e., will the reasoner complete
classification before the ontology engineer gives up), 3) w.r.t. acceptance, failure means
either rejecting the input or crashing while processing, and we might reasonably ex-
pect an engineer to wait up to 2 hours if the ontology seems “worth it”. If a reasoner
(or a set of reasoners) is successful for 90% of a corpus, we count that reasoner as ro-
bust over that corpus, with 95% and 99% indicating “strong” and “extreme” robustness.
While these levels are clearly arbitrary (as is the timeout), they provide a framework to
set expectations. Robustness under these assumptions does not ensure robustness under
other assumptions (e.g., over subsets of these ontologies as experienced during devel-
opment or over a more stringent time constraint), yet they are challenging enough that
it was unclear to us ex ante whether any reasoners would be robust for any corpus. In
fact, we find that the reasoners are robust or near robust for most of the cases we ex-
amine including for lower timeouts. More significantly, if we take the best result for
each ontology (which represents a kind of “meta-reasoner”, where our test reasoners
are run in parallel), then the set of reasoners is extremely robust over all corpora. Thus,
in a fairly precise, if limited, sense, we demonstrate that SHOIQ and SROIQ are
practical description logics.


2     Materials & Methods

For our input data, we gathered three sets of ontologies from the Web — all versions of
the NCI Thesaurus (NCIt), ontologies in the NCBO Bioportal repository, and the results
of a Web crawl, each with fundamentally different characteristics.
    The NCIt has been continuously developed and published in monthly versions since
2003. The NCIt archive1 contains 106 versions parseable by the OWL API [10],2 from
release 02.00 (October 2003) through to release 12.11d (November 2012) ranging in
size from 49,475 to 133,900 logical axioms and in expressivity from ALE to SH(D).
The NCIt team is a fairly stable, closed team of about 20 ontology developers who use a
highly regimented process and, at least since 2006, have incorporated OWL reasoners in
their tools chain (namely FaCT++ and Pellet). The NCIt is large and easily accessible,
thus has been an informal benchmark for reasoner developers. Additionally, the NCI
has funded various infrastructure projects, including improvements to reasoners. Thus,
we might reasonably expect that reasoners are robust w.r.t. this corpus, both because the
NCI team may be tuning their ontology to the available reasoners (though the fact that
they fund improvements suggests not), and because reasoner developers are tuning for
NCIt.
    The NCBO Bioportal is a Web based repository for health care and life science on-
tologies. We use a snapshot of (publicly downloadable ontologies from) the BioPortal
 1
     http://evs.nci.nih.gov/ftp1/NCI_Thesaurus
 2
     http://owlapi.sourceforge.net
repository from November 2012, consisting of 292 OWL and OBO parseable ontolo-
gies. The average number of logical axioms in the corpus is 28,439 (total: 8,190,504
and median: 979 axioms), and 89 of these ontologies contain named individuals. 4 on-
tologies contained no logical axioms at all and thus were discarded. In expressivity, the
ontologies range from the inexpressive AL DL to the very expressive SROIQ. The
ontologies are developed and used in a wide range of largely unrelated projects for a
variety of purposes using a variety of tools. While Bioportal has received some attention
from the research community, it is not yet a standard target for reasoner developers.
     The third corpus, obtained by a short Web crawl and fuelled by a high number
of seeds from Swoogle, Google and ontology repositories on the Web, was collected
in November 2012. We picked a random sample of 822 ontologies, out of which 145
contained no logical axioms at all and thus were discarded, leaving 677 ontologies for
our experiment. The average number of logical axioms is 2,405 (total: 1,628,207 and
median: 57), and the expressivity ranges from AL to SRIQ. These ontologies span
a wide range of subjects and are completely uncontrolled with respect to their origin.
Perhaps not surprisingly, there are fewer axioms overall and on average, with half of
the ontologies containing under 60 axioms. This may reflect less commitment to the
ontologies than we see in the more curated set. However, there is no reason to think that
the reasoners have been specially tuned to these ontologies and, given the worst case
complexity of the logics, even small ontologies are a potential pathological case. Thus,
it is unclear what the rational robustness expectation is for this set.
     We selected four reasoners for testing based on the following criteria: a) coverage
of all of OWL 2, b) freely available for download, c) native support for the OWL API,
and finally d) based on sound, complete and terminating algorithms. As such, the cho-
sen reasoners are Pellet [19], HermiT [18], FaCT++ [20], and JFact. We excluded, e.g.,
the RacerPro [9], CB [13], and KAON23 reasoners due to their lack of coverage for all
OWL 2 features, and no native support for the OWL API. Finally, we did not consider
approximate (either unsound or incomplete) reasoners, such as TrOWL [16], so that
we can compare classification results between reasoners, and because we feel that ap-
proximation is generally only considered in cases where sound and complete reasoners
fail.
     For all our experiments we use the current 2013 reasoner versions, namely, Pellet
v2.3.0, HermiT v1.3.6, FaCT++ v1.6.1 and JFact v1.0. However, since NCIt perfor-
mance has been studied previously [7], we decided to compare the current reasoner
versions with the versions used in the 2011 study, namely Pellet v2.2.2, HermiT v1.3.3,
FaCT++ v1.5.3 and JFact v0.2, in order to test how much tuning to NCIt occurs.
     As mentioned earlier, we set the classification timeout tp 2 hours per ontology-
reasoner pair. From a scenario perspective, 2 hours is rather generous – many ontolo-
gists will give up much sooner than that. However, 2 hours gives us an idea of which
“hard” ontologies are clearly in striking distance, without making completing the ex-
periments infeasible. In the presentation below, we examine a tighter timeout (and thus
harder robustness criterion) of about 100 seconds. The main experiment machine has
an Intel Quad-Core Xeon 3.2GHz processor with 32GB DDR3 RAM. A second exper-
iment involving solely the NCIt (and both reasoner sets) was performed on a machine

 3
     http://kaon2.semanticweb.org
with an Intel Dual-Core i7 2.7GHz processor, with 16GB DDR3 RAM. All tests were
run on Mac OS X 10.7.5, using Java v1.7 and the OWL API v3.4.1.
    The test corpora, experiment results, and reasoners used are available from http:
//sites.google.com/site/reasonerbenchmark.4


3      Results

In all experiments we categorise ontology classification times into the following bins:
Very Easy (≤ 1 second), Easy (1-10 seconds), Medium (10-100 seconds), Hard (100-
1000 seconds), and Very Hard (>1000 seconds). We denote “Impatient Robustness” as
a measure of how many ontologies terminate in an acceptable time for most users, i.e.,
ontologies in the Medium bin or below. Throughout this section we use “Best Combo”
as the best of all of 4 reasoners’ results (i.e., fastest time), and, analogously, “Worst
Combo” as the worst.


3.1     NCI Thesaurus

In this experiment we test both 2011 and 2013 reasoner versions sets, and compare
the performance behaviour of each reasoner. The classification times for both reasoner
sets are shown in Figure 1. Using the reasoner versions from 2011, and taking into
account those ontologies that all reasoners managed to process and classify, FaCT++ is
on average the fastest of all 4 reasoners, taking 14.7 seconds per version. JFact comes
second, with an average of 22.9 seconds per version, while Pellet is the third fastest,
taking on average 36.5 seconds per version, and finally HermiT is the slowest, with 150
seconds per version (see Table 1).
     When switching to the 2013 reasoner sets the performance winner remains FaCT++,
with an average of 19.2 seconds per ontology. However in second place now comes Pel-
let, taking 61.6 seconds on average, in third HermiT with an average of 174 seconds,
and finally JFact taking 180 seconds on average per ontology. Notice that from 2011 to
2013 there was a significant improvement in JFact’s robustness, with far fewer errors.
Similarly Pellet has less errors in its most recent version, and the performance is supe-
rior to the 2011 version. FaCT++ and HermiT’s performance slightly decreased from
2011 to 2013 on this corpus, though not nearly as much as JFact, possibly because in
its most recent version JFact is able to process the more recent versions of the NCIt.
     Overall, FaCT++ is the most robust reasoner for the NCIt corpus, having no errors
in either 2011 or 2013 versions (see Tables 1 and 2). Furthermore, it is the fastest per-
forming reasoner across nearly all versions. The least robust is, interestingly, FaCT++’s
port to the Java language: JFact, due to the high number of errors reported. Though there
is improvement from 2011 to 2013, this “young” reasoner is still not as fast as FaCT++.
The reasoner errors encountered throughout the NCIt were, by Pellet: “OutOfMemory”
errors, by HermiT: “StackOverflow”, and finally by JFact: “IllegalArgument”.

 4
     Full set of crawled ontologies is available upon request.
 600	
  

 500	
  

                                                                                                                                                                                                                                                                                                                                                                                     Pellet'13	
  
 400	
  
                                                                                                                                                                                                                                                                                                                                                                                     Pellet'11	
  
                                                                                                                                                                                                                                                                                                                                                                                     HermiT'13	
  
                                                                                                                                                                                                                                                                                                                                                                                     HermiT'11	
  
 300	
  
                                                                                                                                                                                                                                                                                                                                                                                     FaCT++'13	
  
                                                                                                                                                                                                                                                                                                                                                                                     FaCT++'11	
  
                                                                                                                                                                                                                                                                                                                                                                                     JFact'13	
  
 200	
                                                                                                                                                                                                                                                                                                                                                                               JFact'11	
  

 100	
  

     0	
  
             v1	
  
                      v4	
  
                               v7	
  

                                                                                          v25	
  

                                                                                                              v31	
  

                                                                                                                                                                                              v55	
  

                                                                                                                                                                                                                  v61	
  

                                                                                                                                                                                                                                                                                                  v85	
  

                                                                                                                                                                                                                                                                                                                      v91	
  

                                                                                                                                                                                                                                                                                                                                                    v100	
  
                                                                                                                                                                                                                                                                                                                                                               v103	
  
                                                                                                                                                                                                                                                                                                                                                                          v106	
  
                                        v10	
  
                                                  v13	
  
                                                            v16	
  
                                                                      v19	
  
                                                                                v22	
  

                                                                                                    v28	
  

                                                                                                                        v34	
  
                                                                                                                                  v37	
  
                                                                                                                                            v40	
  
                                                                                                                                                      v43	
  
                                                                                                                                                                v46	
  
                                                                                                                                                                          v49	
  
                                                                                                                                                                                    v52	
  

                                                                                                                                                                                                        v58	
  

                                                                                                                                                                                                                            v64	
  
                                                                                                                                                                                                                                      v67	
  
                                                                                                                                                                                                                                                v70	
  
                                                                                                                                                                                                                                                          v73	
  
                                                                                                                                                                                                                                                                    v76	
  
                                                                                                                                                                                                                                                                              v79	
  
                                                                                                                                                                                                                                                                                        v82	
  

                                                                                                                                                                                                                                                                                                            v88	
  

                                                                                                                                                                                                                                                                                                                                v94	
  
                                                                                                                                                                                                                                                                                                                                          v97	
  
Fig. 1. Comparison of classification times between the 2011 reasoner version set (suffixed ’11)
and the 2013 set (suffixed ’13) over the NCIt corpus (y-axis: time in seconds, x-axis: version
number).

                          Pellet HermiT JFact FaCT++ Best Combo Worst Combo
        Very Easy         0 (0%) 0 (0%) 0 (0%) 0 (0%)            0 (0%)        0 (0%)
          Easy           16 (15%) 15 (14%) 16 (15%) 18 (17%) 18 (17%)         15 (14%)
         Medium          70 (66%) 42 (40%) 52 (49%) 88 (83%) 88 (83%)         15 (14%)
          Hard           18 (17%) 48 (45%) 0 (0%) 0 (0%)         0 (0%)       37 (35%)
       Very Hard          0 (0%) 0 (0%) 0 (0%) 0 (0%)            0 (0%)        0 (0%)
         Timeout          0 (0%) 0 (0%) 0 (0%) 0 (0%)            0 (0%)        0 (0%)
          Errors          2 (2%) 1 (1%) 38 (36%) 0 (0%)          0 (0%)       39 (37%)
  Impatient Robustness 81%          54%       64%       100%      100%          28%
   Overall Robustness      98%      99%       64%       100%      100%          63%
 Table 1. Binning of the NCIt corpus according to performance (2011 reasoner versions sets).


                          Pellet HermiT JFact FaCT++ Best Combo Worst Combo
        Very Easy         0 (0%) 0 (0%) 0 (0%) 0 (0%)            0 (0%)        0 (0%)
          Easy           16 (15%) 15 (14%) 0 (0%) 19 (18%) 19 (18%)            0 (0%)
         Medium          71 (67%) 42 (40%) 24 (23%) 87 (82%) 87 (82%)         23 (22%)
          Hard           19 (18%) 48 (45%) 70 (66%) 0 (0%)       0 (0%)       70 (66%)
       Very Hard          0 (0%) 0 (0%) 0 (0%) 0 (0%)            0 (0%)        0 (0%)
         Timeout          0 (0%) 0 (0%) 0 (0%) 0 (0%)            0 (0%)        0 (0%)
          Errors          0 (0%) 1 (1%) 12 (11%) 0 (0%)          0 (0%)       13 (12%)
  Impatient Robustness 82%          54%       23%       100%      100%          22%
   Overall Robustness 100%          99%       89%       100%      100%          88%
 Table 2. Binning of the NCIt corpus according to performance (2013 reasoner versions sets).
3.2   NCBO BioPortal


In the snapshot of BioPortal, out of all 288 non-empty ontologies, 9 ontologies are
inconsistent and there are 234 that all reasoners manage to classify within the timeout
(see Table 3). Out of those ontologies where all reasoners completed classification,
FaCT++ was on average the fastest (2.9 seconds per ontology), followed by JFact (5.9
seconds), HermiT (9.8 seconds), and finally Pellet (16.7 seconds). However, in terms of
robustness, our results show that Pellet is the most robust of all reasoners, only failing
to handle 9 ontologies, while HermiT, the least robust, fails to classify 27 ontologies.
FaCT++ and JFact fail to handle 24 and 25 ontologies respectively (see Table 4 for more
details regarding errors).
    Generally, Pellet is not only the most robust reasoner for BioPortal, with fewer
errors, but also exhibits fast performance on a high number of ontologies. However,
it does have the most timeouts; but note that some of these were on ontologies that
other reasoners threw an error on. The remaining 3 reasoners are very close to each
other performance and robustness-wise, HermiT with less timeouts but more errors than
JFact and FaCT++, and slower performance. Thus HermiT is the least robust reasoner
for BioPortal.
                         Pellet   HermiT       JFact    FaCT++ Best Combo Worst Combo
     Very Easy        190 (66%) 170 (59%) 184 (64%) 218 (76%) 236 (82%) 152 (53%)
        Easy           56 (19%) 61 (21%) 58 (20%) 24 (8%) 28 (10%)              58 (20%)
      Medium            10 (3%) 15 (5%)       8 (3%)     7 (2%)      11 (4%)     10 (3%)
        Hard            4 (1%)     4 (1%)     2 (1%)     2 (1%)       4 (1%)     2 (1%)
     Very Hard          6 (2%)     3 (1%)     0 (0%)     3 (1%)       4 (1%)     2 (1%)
      Timeout           13 (5%)    8 (3%)    11 (4%) 10 (3%)          5 (2%)     15 (5%)
       Errors           9 (3%)    27 (9%) 25 (9%) 24 (8%)             0 (0%)    49 (17%)
Impatient Robustness     89%        85%        87%        86%          95%        76%
 Overall Robustness      92%        88%        88%        88%          98%        78%
             Table 3. Binning of the BioPortal corpus according to performance.


                             Error          Pellet HermiT JFact FaCT++
                        StackOverflow          2       0         1       0
                        OutOfMemory            1       1         2       0
                    UnsupportedDatatype        0      13         4       14
                      InternalReasoner         2       0         1       0
                       IllegalArgument         0      12        16       6
                      MalformedLiteral         0       1         0       0
                   ConcurrentModification      3       0         0       0
                      Reasoner crashed         0       0         0       4
                     IndexOutOfBounds          1       0         1       0
                          Total Errors         9      27        25       24
   Table 4. Errors and exceptions that occurred during classification of BioPortal ontologies.
3.3   Web Crawl Corpus


Out of the 677 non-empty ontologies from the Web crawl corpus, all reasoners com-
pleted classification of 560 of them. In these 560, Pellet was the fastest reasoner on
average (0.5 seconds per ontology), followed by FaCT++ (1.5 seconds), HermiT (3.1
seconds), and finally JFact (6.2 seconds). In terms of robustness, Pellet is, again, the
most robust, having only thrown errors on 17 ontologies (see Table 5). It is also the rea-
soner with most timeouts, but again, several times where other reasoners threw errors.
FaCT++ and HermiT both have a high number of errors, while, curiously, JFact did
much better on that front in this corpus. In Table 6 the errors found across the corpus
are broken down.
                        Pellet     HermiT      JFact    FaCT++ Best Combo Worst Combo
     Very Easy        597 (88%) 536 (79%) 557 (82%) 566 (84%) 642 (95%) 493 (73%)
        Easy           44 (6%) 36 (5%) 45 (7%) 12 (2%)              26 (4%)      44 (6%)
      Medium            2 (0%)     8 (1%)    11 (2%)     0 (0%)      3 (0%)      12 (2%)
        Hard            1 (0%)     1 (0%)     4 (1%)     5 (1%)      2 (0%)       3 (0%)
     Very Hard          0 (0%)     1 (0%)     1 (0%)     1 (0%)      0 (0%)       1 (0%)
      Timeout          16 (2%)     6 (1%)     5 (1%)     5 (1%)      4 (1%)      10 (1%)
  Reasoner Errors      17 (3%) 89 (13%) 54 (8%) 88 (13%)             0 (0%)     114 (17%)
Impatient Robustness     95%        86%        91%        85%         99%          81%
 Overall Robustness      95%        86%        91%        86%         99%          82%
             Table 5. Binning of the Web crawl corpus according to performance.


                            Error          Pellet Hermit JFact FaCT++
                       StackOverflow         13       0       0         0
                       OutOfMemory           2        0       2         0
                         NullPointer         0        0      36         0
                     UnloadableImport        0        1       1         1
                          ClassCast          0        0       1         0
                   UnsupportedDatatype       0       81       1        86
                     Datatype constraint     2        0       0         0
                      IllegalArgument        0        3       5         0
                     MalformedLiteral        0        2       0         0
                      ReasonerInternal       0        0       8         1
                     UnsupportedFacet        0        2       0         0
                            Total            17      89      54        88
Table 6. Errors and exceptions that occurred during classification of the Web crawl ontologies.


    Overall Pellet is the most robust and fastest (among ontologies that could be classi-
fied by all reasoners) reasoner for this corpus, followed closely by JFact, both in terms
of robustness and performance. The least robust reasoners for the Web crawl corpus are
FaCT++ and HermiT, with 88 and 89 errors, respectively. However, HermiT performed
slightly better on the lower bins, while FaCT++ was clearly the slowest in this corpus.
4   Discussion

Overall we have processed a total of 1,071 ontologies, the largest such reasoner bench-
mark, having found that amongst the 4 tested reasoners Pellet is the most robust of all
(see Table 7). Surprisingly, Pellet is followed by JFact on our robustness test, due to
having far less errors than FaCT++. HermiT and FaCT++ have the same overall robust-
ness, but FaCT++ has less errors and higher impatient robustness.
                          Pellet    HermiT       JFact      FaCT++ Best Combo Worst Combo
      Very Easy        787 (73%) 706 (66%) 741 (69%) 784 (73%) 878 (82%) 645 (60.2%)
         Easy          116 (11%) 112 (10%) 103 (10%) 55 (5%)            73 (7%)  102 (9.5%)
       Medium           83 (8%) 65 (6%) 43 (4%) 94 (9%)               101 (9%)    45 (4.2%)
         Hard           24 (2%) 53 (5%) 76 (7%)              7 (1%)      6 (1%)   75 (7.0%)
      Very Hard          6 (1%)      4 (0%)     1 (0%)       4 (0%)      4 (0%)   3 (0.3%)
       Timeout          29 (3%) 14 (1%) 16 (1%) 15 (1%)                  9 (1%)   25 (2.3%)
         Errors         26 (2%) 117 (11%) 91 (8%) 112 (10%)              0 (0%)  176 (16.4%)
  Total (excl. Errors)    1016        940         964         944         1062       870
  Total (incl. Errors)    1071        1071       1071         1071        1071       1071
 Impatient Robustness 92% 82% [90%] 83% 87% [96%]                         98%    74% [87%]
  Overall Robustness      95% 88% [96%] 90% 88% [97%]                     99%    81% [96%]
Table 7. Binning of all three corpora: BioPortal, NCIt (2013), and Web crawl. Under robustness
rows, values in square brackets indicate robustness w.r.t. OWL 2 alone.


    While Pellet is the most robust reasoner, we urge some caution in that reading. In
particular, this does not mean that Pellet will always do best or even perform reasonably.
In fact, it may timeout where other reasoners finish reasonably fast. The set of reasoners
(taken together and considering the best results) is extremely robust across the board
(for each reasoner’s contribution to the best case reasoner, see Figure 2). Thus, we
have strong empirical evidence that the ontologies on the Web do not supply many in
principle intractable cases, but only cases which are difficult for particular reasoners.


                                               1000	
  
                                                900	
  
                                                800	
  
                 Nr.	
  Ontologies	
  

                                                700	
  
                                                600	
  
                                                500	
  
                                                400	
  
                                                300	
  
                                                200	
  
                                                100	
  
                                                  0	
  
                                                            Pellet	
     HermiT	
     JFact	
     FaCT++	
  
                                         Web	
  Crawl	
      203	
        160	
        193	
        708	
  
                                         NCIt	
  2013	
        2	
           0	
         0	
        104	
  
                                         BioPortal	
          21	
          18	
       24	
          94	
  

       Fig. 2. Number of times that each reasoner equals the best case, for each corpus.
    Note that FaCT++ and JFact fail to process several ontologies due to poor support
for OWL datatypes, particularly datatypes not specified in the OWL 2 datatype map;
both of these reasoners, as well as HermiT, have little support for OWL 1 datatypes.
By removing the non OWL 2 datatype errors, we would end up with FaCT++ being the
most robust w.r.t. OWL 2, followed by HermiT and Pellet. From Figure 2 we see that
FaCT++ outperforms other reasoners on many occasions, but, due to the high number
of errors thrown, its robustness w.r.t. our input data is not nearly at the same level as its
performance.
    The 9 ontologies which no reasoner classified within the timeout range in expres-
sivity between ALEHIF+ and SRIQ. Their average number of logical axioms is
56,179; the minimum is 341 axioms - SRIQ ontology, maximum 379,734 axioms -
SR ontology, and median 17,385 axioms - SHIF ontology.
    It is clear that deriving a sensible ranking even simply using average or total time
is not straightforward. Our results have rather strong implications for reasoner experi-
ments, especially those purporting to show the advantages of an optimisation or a tech-
nique or an implementation: The space is very complex and it is very easy to simul-
taneously generate a biased sample for one system and against another. Even simple,
seemingly innocuous things like timeouts and classification failures require tremendous
care in handling. If results are going to be meaningful across papers we need to converge
on experimental inputs, methods, and reporting forms.
    Finally, in order to get an overall picture of how these robustness measurements
relate to the OWL profile in which ontologies fit into, we: divide our ontologies into
their corresponding OWL profile, and match them with the observed performance bin
of the Best and Worst Combo reasoners. This is displayed in Figure 3.
    Since there is an overlap between the EL, RL and QL profiles of OWL 2, some
ontologies are counted in more than one such bin, meaning that the total number of
ontologies in Figure 3 does not add up to the number of ontologies in our corpus. How-
ever, the ontologies contained in the DL profile bin are exclusive, i.e., an ontology in the
EL profile is not counted again within the DL profile. Note that, even though ontologies
in the EL, RL and QL profiles of OWL are typically in the easier bins, there are some
which are deemed hard, time out, or even result in error.

5   Related Work
There is extensive work in benchmarking reasoners, some of which focuses purely on
either classification or (conjunctive) query answering (e.g., [8,17]). Generally, previous
reasoner benchmarks used much smaller and rather ad hoc data sets, in some cases
using artificial data. For the purposes of this paper, we focus solely on work involving
the classification task, particularly using realistic rather than artificially-generated test
data.
    The Pellet reasoner was evaluated, in [19], with a corpus of 9 ontologies, presenting
the average of 10 independent runs of a reasoning task - the tasks under test being
consistency checking, classification and realization. Additionally, the authors compare
Pellet against FaCT++ and RacerPro in terms of classification time only, using the DL
benchmark test suite described in [11]. The experiment showed that Pellet was not as
efficient as FaCT++ or RacerPro in many, but not all, cases.
                                             600	
                                                                    70	
  
 600	
                                                 70	
  

 500	
                                                 60	
  

                                                       50	
  
 400	
  
                                                                                          41	
                                                                                                                                   EL	
  
                                                       40	
  
                                   272	
                                                                                                                                                                                         QL	
  
 300	
  
               244	
     251	
  
                                                       30	
                                                                                                                                                                      RL	
  

 200	
                                                                                                                                                                                                                           DL	
  
                                                       20	
  
                                                                    11	
  
 100	
                                                                    9	
                                                                                                                   8	
  
                                                       10	
                       5	
               5	
  
                                                                                                            3	
   2	
                               4	
                  3	
  
                                                                                                                                   1	
  
      0	
                                                0	
  
                EL	
     QL	
       RL	
      DL	
                       Easy	
                     Medium	
                           Hard	
                Very	
  Hard	
      Timeout	
                  Error	
  

     (a) Best Combo on ‘Very Easy’ bin                                    (b) Best Combo on on the remaining bins

 450	
                                       438	
     180	
                                                                                                                                                           177	
  

 400	
                                                 160	
  

 350	
                                                 140	
  

 300	
                                                 120	
  
                                   246	
                                                                                                                                                                                         EL	
  
 250	
                   219	
                         100	
  
                                                                                                                                                                                                                                 QL	
  
               196	
  
 200	
                                                   80	
                                                                                                                                                                    RL	
  
 150	
                                                   60	
         49	
                 52	
                                                                                                                                  DL	
  

 100	
                                                   40	
              30	
  
                                                                             22	
                                         23	
  
                                                                                                                                                    16	
                                       17	
  
     50	
                                                20	
                                                                                                                                           7	
   6	
   5	
  
                                                                                                      5	
   6	
   5	
               1	
   1	
   1	
                      3	
     3	
   1	
  
       0	
                                                  0	
  
                EL	
      QL	
      RL	
      DL	
                          Easy	
                    Medium	
                             Hard	
            Very	
  Hard	
      Timeout	
                  Error	
  

 (c) Worst Combo on ‘Very Easy’ bin                                            (d) Worst Combo on the remaining bins

Fig. 3. Number of ontologies in each OWL 2 profile displayed according to the performance
profile of the Best and Worst Combo reasoners. On the left-hand side (Figures 3(a) and 3(c))
we show the OWL profile distribution of the ontologies in the ‘Very Easy’ performance bin, as
it is the most densely populated bin. While on the right-hand side (Figures 3(b) and 3(d)) the
remaining performance bins.
    In [6] the authors present a system for comparing reasoners both in terms of perfor-
mance and correctness of classification results. Four reasoners are put to test: FaCT++,
Pellet, KAON2 and RacerPro, over a corpus of 172 naturally occurring ontologies, out
of which only 31 were either in or more expressive than ALC. The benchmark results
show that Pellet was the most robust reasoner, with FaCT++ a close second, being able
to process, respectively, 143 and 137 ontologies. In terms of classification time, the au-
thors state that “there is no clear winner”, due to considerable fluctuation of reasoner
performance across ontologies.
    The evaluation of the HermiT reasoner [18] was carried out against the FaCT++
and Pellet reasoners, using a corpus of ontologies derived from the Gardiner data set
[6], the Open Biological Ontologies (OBO) Foundry,5 and finally, several versions of
the GALEN ontology. The result was that HermiT outperforms the other reasoners in
the majority of tested ontologies.


 5
      http://obofoundry.org
    In [3] the authors carry out a benchmark of ontologies derived from the Watson
repository.6 Out of the 6,224 ontologies in Watson, only 3,303 were parseable by
both Swoop and the KAON2 tools. These were then classified into 4 bins according
to their expressivity; RDFS(DL), OWL DLP, OWL Lite, and OWL DL. From these
bins, the authors picked 1 representative per bin, according to its popularity in previous
benchmarks. The test itself involved the reasoners HermiT, Pellet, RacerPro, KAON2,
OWLIM and Sesame, where the classification performance results show that HermiT
was fastest in 3 out 4 cases, OWLIM being the fastest in the RDFS(DL) representative.
    The author of [15] performs a benchmark of the Pellet, FaCT++ and Racer rea-
soners, though using different interfaces (FaCT++ used DIG at the time) - thus the
results are not directly comparable. This benchmark was carried out using a corpus of
135 OWL ontologies from Schemaweb.7 The experiment showed that FaCT++ was the
fastest (excluding timeouts) and the most robust, since it processed the most ontologies
without timing-out or aborting (due to errors unrevealed by the author).
    The benchmark carried out in [2] compared the KAON2, Pellet, Racer, HermiT and
FaCT++ reasoners, against 50 naturally occurring ontologies. However, in the paper, the
authors focus only on a few examples; Racer was fastest on the Wine ontology, HermiT
on DeepTree, FaCT++ on the NCI Thesaurus, and HermiT on GALEN.


6     Future Work


In this paper, we did not have space to discuss whether there is a performance/size
or performance/expressivity correlation. By and large, our analysis shows that there
is a roughly linear correlation between performance and size, and no correlation with
expressivity.
    Due to the large size of the Web crawl corpus, we resorted to sampling in order to
obtain results in time. Though we have tested large enough samples to attain statistical
significance, we hope to complete processing all ontologies in said corpus in the near
future. For the purposes of this paper we limited our attention to classification, but could
easily extend our benchmarking to other inference problems, even to non-standard ones
such as justification finding. We also intend to tackle the vast task of identifying promis-
ing correlations between features of ontologies and their reasoning difficulty.
    To address the difficulties in stable, cross-experiment comparison and interpreta-
tion, we propose to establish a comprehensive benchmark which is updated yearly. To
facilitate rapid experimentation, we will provide canonical stable random samples so
that experimenters can provide a comparable baseline, even if for scientific reasons
they must also investigate other inputs. We will also make our test framework and com-
puting platform available, re-running all the experiments we can gather in the prior year
to provide systematic review and replication of results.


 6
     http://watson.kmi.open.ac.uk/WatsonWUI/
 7
     http://schemaweb.info/
References
 1. Baader, F., Brandt, S., Lutz, C.: Pushing the EL envelope. In: Proc. of the 19th Int. Joint
    Conf. on Artificial Intelligence (IJCAI-05) (2005)
 2. Babik, M., Hluchy, L.: A testing framework for OWL-DL reasoning. In: Proc. of the Int.
    Conf. on Semantics, Knowledge and Grids (SKG-08) (2008)
 3. Bock, J., Haase, P., Ji, Q., Volz, R.: Benchmarking owl reasoners. In: Proc. of the Int. Work-
    shop on Advancing Reasoning on the Web: Scalability and Commonsense (ARea-08) (2008)
 4. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Rosati, R.: Tractable reasoning
    and efficient query answering in description logics: The DL-Lite family. J. of Automated
    Reasoning 39(3), 385–429 (2007)
 5. Cuenca Grau, B., Horrocks, I., Motik, B., Parsia, B., Patel-Schneider, P.F., Sattler, U.: OWL
    2: The next step for OWL. J. of Web Semantics (2008)
 6. Gardiner, T., Tsarkov, D., Horrocks, I.: Framework for an automated comparison of descrip-
    tion logic reasoners. In: Proc. of the 5th Int. Semantic Web Conf. (ISWC-06) (2006)
 7. Gonçalves, R.S., Parsia, B., Sattler, U.: Analysing the evolution of the NCI thesaurus. In:
    Proc. of the 24th IEEE Int. Symposium on Computer-Based Medical Systems (CBMS-11)
    (2011)
 8. Guo, Y., Pan, Z., Heflin, J.: LUBM: A benchmark for OWL knowledge base systems. J. of
    Web Semantics 3(2-3), 158–182 (2005)
 9. Haarslev, V., Möller, R.: RACER system description. In: Proc. of the 1st Int. Joint Conf.
    on Automated Reasoning (IJCAR-01). Lecture Notes in Artificial Intelligence, vol. 2083.
    Springer-Verlag (2001)
10. Horridge, M., Bechhofer, S.: The OWL API: A Java API for working with OWL 2 ontologies.
    In: Proc. of the 6th Int. Workshop on OWL: Experiences and Directions (OWLED-09) (2009)
11. Horrocks, I., Patel-Schneider, P.F.: DL systems comparison. In: Proc. of the 11th Int. Work-
    shop on Description Logics (DL-98) (1998)
12. Horrocks, I., Patel-Schneider, P.F., van Harmelen, F.: From SHIQ and RDF to OWL: The
    making of a web ontology language. J. of Web Semantics 1(1), 7–26 (2003)
13. Kazakov, Y.: Consequence-driven reasoning for Horn SHIQ ontologies. In: Proc. of the
    21st Int. Joint Conf. on Artificial Intelligence (IJCAI-09) (2009)
14. Knublauch, H., Fergerson, R.W., Noy, N.F., Musen, M.A.: The Protégé OWL plugin: An
    open development environment for semantic web applications. In: Proc. of the 3rd Int. Se-
    mantic Web Conf. (ISWC-04) (2004)
15. Pan, Z.: Benchmarking DL reasoners using realistic ontologies. In: Proc. of the 1st Int. Work-
    shop on OWL: Experiences and Directions (OWLED-05) (2005)
16. Ren, Y., Pan, J.Z., Zhao, Y.: Soundness preserving approximation for tbox reasoning. In:
    Proc. of the 24th AAAI Conf. on Artificial Intelligence (AAAI-10) (2010)
17. Sattler, U., Motik, B.: A comparison of reasoning techniques for querying large description
    logic aboxes. In: Proc. of the 13th Int. Conf. on Logic for Programming and Automated
    Reasoning (LPAR-06) (2006)
18. Shearer, R., Motik, B., Horrocks, I.: HermiT: A highly-efficient OWL reasoner. In: Proc. of
    the 5th Int. Workshop on OWL: Experiences and Directions (OWLED-08EU) (2008)
19. Sirin, E., Parsia, B., Cuenca Grau, B., Kalyanpur, A., Katz, Y.: Pellet: A practical OWL-DL
    reasoner. J. of Web Semantics 5(2), 51–53 (2007)
20. Tsarkov, D., Horrocks, I.: FaCT++ description logic reasoner: System description. In: Proc.
    of the 3rd Int. Joint Conf. on Automated Reasoning (IJCAR-06) (2006)