=Paper= {{Paper |id=Vol-350/paper-1 |storemode=property |title=Benchmarking OWL Reasoners |pdfUrl=https://ceur-ws.org/Vol-350/paper1.pdf |volume=Vol-350 }} ==Benchmarking OWL Reasoners== https://ceur-ws.org/Vol-350/paper1.pdf
                 Benchmarking OWL Reasoners

                Jürgen Bock1 , Peter Haase2 , Qiu Ji2 , Raphael Volz1
      1
          FZI Research Center for Information Technologies, Karlsruhe, Germany
                   2
                     Institute AIFB, University of Karlsruhe, Germany
             {bock,volz}@fzi.de, {haase,qiji}@aifb.uni-karlsruhe.de



      Abstract. The growing popularity of semantic applications makes scal-
      ability of ontology reasoning tasks increasingly important. In this work,
      we first analyze the ontology landscape on the web, and identify typical
      clusters of expressivity. Second, we benchmark current ontology reason-
      ers, by using representative ontologies for each cluster and a compre-
      hensive set of queries. We point out applicability of specific reasoners to
      certain expressivity clusters and reasoning tasks.


1   Introduction

Semantic applications based on ontologies have become increasingly important
in recent years. Yet, scalability remains one of the major obstacles in leveraging
the full power of using ontologies for practical applications. Reasoning with OWL
ontologies has high worst complexity, but indeed many large scale applications
normally only use fragments of OWL that are rather shallow in logical terms
and do not require sophisticated reasoning algorithms. Surveying the landscape
of existing ontologies, we observe a broad spectrum of ontologies that differ in
terms of size, complexity and their ratio between terminological and factual as-
sertions. Keet and Rodrı́guez [9] point out that there is a high demand for either
very expressive ontology languages to represent rather complete knowledge, or
less expressive ontology languages, which are more tractable w.r.t. reasoning or
other computational tasks. This issue is often called the computational cliff and
leads to the problem, that often only small fragments of ontology languages are
exploited, where reasoners are used that are optimized to a larger number of fea-
tures, and in particular those, that are actually not used in the fragment. Today,
a number of different reasoners are available that are based on quite different
design decisions in addressing the tradeoff between complexity and expressive-
ness on the one hand and scalability on the other hand: Classical description
logic reasoners based on tableau algorithms are able to classify large, expressive
ontologies as often found e.g. in the bio-medical domain, but they often provide
limited support in dealing with large number of instances. Database-like reason-
ers that materialize inferred knowledge upfront are able to handle large amounts
of assertional facts, but are in principle limited in terms of the logic they are
able to support. Deciding for an appropriate reasoner for a given application task
is far from trivial. In order to support such decisions, comparisons of reasoners
based on benchmarks are required.
    While a number performance evaluations for OWL reasoners have already
been performed in the past, all of them so far targeted only special purpose
tasks, e.g. focussing either on classical description logic reasoning tasks [4], or
on answering conjunctive queries over large knowledge bases based on rather
inexpressive ontologies [7]. In our work we aim to go a step further and intend to
provide guidance for selecting the appropriate reasoner for a given application
scenario. In order to do so, we provide a survey of the ontology landscape,
discuss typical reasoning tasks and define a comprehensive benchmark. Based
on the benchmark results we identify which reasoners are most adequate for
which classes of ontologies and corresponding reasoning tasks.
    The paper is organized as follows: In Section 2 we discuss related work on
benchmarking OWL reasoners. Based on an overview of the ontology landscape
and relevant language fragments provided in Section 3, we define our benchmark
in Section 4. In Section 5, we give a description of the reasoners selected for our
benchmark. In Section 6 we report on the experiments performed. We conclude
with an outlook to future work in Section 7.


2   Related Work

With the availability of practical reasoners for OWL, a number of benchmarks
for evaluating and comparing OWL reasoners have been proposed. The first one
– the Lehigh University Benchmark (LUBM) – was proposed by Guo et al. [7].
LUBM concentrates on the reasoning task of answering conjunctive queries over
an OWL Lite ontology with an ABox of varying size. It was later pointed out
that – while the ontology itself is in OWL Lite – answering the proposed queries
does not require OWL Lite reasoning, but instead can be performed by realizing
the ABox, i.e. computing the most specific concept(s) that each individual is
an instance of. This indeed is performed by many system benchmarks, includ-
ing for example the evaluation of RacerPro [8]. In addition to measuring the
performance of the reasoners, LUBM provides a measure for correctness of the
reasoners, analyzing how many of the correct answers are returned (complete-
ness) and how many of the returned answers are correct (soundness). However,
we believe that such measures are not helpful in selecting a reasoner for a given
task. Instead what is missing is a detailed analysis, which fragment of the OWL
ontology language are actually needed for a given task and supported (correctly)
by which reasoner.
    [4] presents a system for comparing DL reasoners that allows users (a) to test
and compare OWL reasoners using an extensible library of real-life ontologies;
(b) to check the correctness of the reasoners by comparing the computed class
hierarchy; (c) to compare the performance of the reasoners when performing this
task. Again, this benchmarking system is only targeted to classical DL reasoning
tasks, disregarding many other practical applications of OWL.
    [11] provides a comparison of reasoning techniques with a focus on querying
large DL ABoxes. The results show that, on knowledge bases with large ABoxes
but simple TBoxes, the KAON2 algorithms for reducing a DL knowledge base to
a disjunctive datalog program show good performance; in contrast, on knowledge
bases with large and complex TBoxes, existing techniques still perform better.
    In [15] the authors pointed out some deficiencies with existing benchmarks
and formulated requirements they would like to see met by new benchmarks.
These requirements were driven by two major use cases: (1) Frequent ABox
changes (situation classification) and (2) rare ABox changes (social networks).
While in our work we rather concentrate on the dimensions of the classes of
ontologies and reasoning tasks, we take most of the defined requirements into
account (cf. Section 4.4).
    We base our benchmark on an analysis of the ontology landscape, similar to
the one conducted by Wang et al. [14], where our more recent analysis is based
on the Watson corpus1 with a larger number of ontologies.


3   Overview of the Web Ontology Landscape
Ontologies on the web are becoming increasingly numerous and differ signifi-
cantly in their expressivity, as well as in the size of their TBoxes and ABoxes.
In order to identify a representative picture of the ontology landscape, we an-
alyzed 3303 ontologies with particular respect to their expressivity. Ontologies
were drawn from the Watson corpus and expressivity determined using statis-
tics provided by the SWOOP editor2 and the KAON2 OWL tools3 . We preferred
SWOOP to determine expressivity, since it provides a more fine-grained break-
down of DL expressivity with respect to tractable fragments than e.g. Pellet
does.
    We made the observation, that out of the 6224 OWL ontologies recorded by
Watson only 3303 could have been loaded by both SWOOP and the KAON2
OWL tools. This mainly traces back to syntactical errors, a problem which has
already been identified by d’Aquin et al. [3].
    In this analysis, we identified four main expressivity fragments, namely the
RDFS fragment of description logics4 , OWL DLP, OWL Lite, and OWL DL.
Hereby, the fragments OWL DL and OWL Lite comprise ontologies that fall
into description logics of at most SHOIN (D) for OWL DL, and SHIF(D)
for OWL Lite, resp. In addition to these, we included the tractable fragment
OWL DLP [6], since it retains a number of interesting features of OWL Lite while
keeping the complexity low to a certain degree. Theoretical investigations proved
the combined complexity for standard reasoning tasks in the DLP fragment to
be ExpTime, while the data complexity for both standard reasoning tasks and
conjunctive query answering remains PTime complete [5]. Figure 1 illustrates
the inclusion of the respected fragments according to their expressivity.
    The OWL Working Group5 is currently also looking at other tractable frag-
ments [5], which we do not address in this work. The reason is, that most of
1
  http://watson.kmi.open.ac.uk/WatsonWUI/
2
  http://www.mindswap.org/2004/SWOOP/
3
  http://owltools.ontoware.org/
4
  We will call this fragment RDFS(DL) subsequently.
5
  http://www.w3.org/2007/OWL/wiki/OWL_Working_Group
                                      OWL DL
                                             @
                                              I
                                              @
                              OWL DLP       OWL Lite
                                   @
                                   I
                                   @           
                                     RDFS(DL)


          Fig. 1. Inclusion of language fragments of different expressivity.



these fragments are heavily discussed and not widely used up to now. Further-
more there are no mature reasoners available, that particularly deal with these
fragments, whereas e.g. reasoners such as OWLIM can process DLP ontologies.
For instance by investigating ontologies from Watson that fall into the DL-Lite
fragment, we noticed, that ontologies of this kind are mostly toy examples and
hence not of significant importance for this benchmark. We also dropped the EL
fragment, where SWOOP found only 3 ontologies.
    The most lightweight complexity fragment we identified was RDFS(DL),
which is, in terms of expressivity, sufficient for representing many taxonomy-
style ontologies. Due to its simplicity it seems to be popular for many use cases
of ontologies in current web based applications.
    In DLP we go clearly beyond RDFS(DL) by allowing more specified proper-
ties, and a restricted form of intersection, universal and existential quantification.
The DLP fragment was one of the smallest fragments we analyzed. This may be
due to the more cautious ontology design that has to be followed, when it comes
to the use of specific features that are not supported in DLP. The fact that DLP
has only very recently been identified as a tractable fragment, also explains the
smaller number of ontologies that have been found, since most DLP ontologies
in our corpus have been identified to be only incidentally in the DLP fragment,
but few might be explicitly designed as such.
    The OWL Lite fragment is determined by a stronger axiomatization of classes,
as the frequent use of restrictions in nearly all of those ontologies demonstrates.
    In the OWL DL fragment we can find the full range of available OWL DL
features, including nominals. This fragment contains the largest ontologies on
average, however, OWL DL ontologies are still less frequent than OWL Lite
ontologies on the web.
    Figure 2 illustrates the ontology landscape according to the different frag-
ments, in terms of number of ontologies and average number of classes in each
fragment. As one can see, the RDFS(DL) fragment is the largest fragment, where
ontologies make use of only the basic, taxonomic features, such as classes and ba-
sic properties. However, there is also a relatively small average number of classes
in RDFS(DL) ontologies, which is mainly because of the large number of ex-
tremely small meta-data files present in the web as FOAF files, or RDF exports
of semantic annotations of certain websites or wiki pages. SWOOP validates a
large percentage (73.45 %) of RDFS(DL) ontologies are classified as OWL Full,
which indicates a high syntactical error rate in these small meta-data files.
                        3500                                                      60



                        3000
                                                                                  50



                        2500
                                                                                  40




                                                                                       Avg. No. of Classes
    No. of Ontologies
                        2000
                                                                                                             Statistics
                                                                                  30                           Number of Ontologies
                                                                                                               Average Number of Classes
                        1500


                                                                                  20
                        1000



                                                                                  10
                        500



                          0                                                       0
                               RDFS(DL)     DLP               OWL Lite   OWL DL
                                                  Fragments




Fig. 2. Total number of ontologies and average number of classes by language frag-
ments.


4                       Benchmark Definition
4.1                       Reasoning Tasks
Many reasoning tasks for OWL correspond to standard description logic reason-
ing tasks, i.e. tasks that allow to draw new conclusions about the knowledge
base or check its consistency. Theoretically it is possible to reduce all reasoning
tasks to the task of checking KB consistency. However in practice this is not
necessarily the fastest way of reasoning and various optimizations are taken for
different tasks. We therefore analyze reasoning tasks separately.

TBox reasoning tasks. Reasoning tasks typically considered for TBoxes are the
following:
 – Satisfiability checks whether a class C can have instances according to the
   current ontology.
 – Subsumption checks whether a class D subsumes a class C according to the
   current ontology. Property subsumption is defined analogously.
As a representative reasoning task we consider classification of the ontology, i.e.
computing the complete subsumption hierarchy of the ontology.

ABox reasoning tasks. ABox reasoning tasks usually come into play at runtime of
the ontology. Reasoning tasks typically considered for ABoxes are the following:
 – Consistency checks whether the ABox is consistent with respect to the TBox.
 – Instance checking checks whether an assertion is entailed by the ABox.
 – Retrieval problem retrieves all individuals that instantiate a class C, dually
   we can find all named classes C that an individual a belongs to.
 – Property fillers retrieves, given a property R and an individual i, all individ-
   uals x which are related with i via R. Similarly we can retrieve the set of all
   named properties R between two individuals i and j, ask whether the pair
   (i, j) is a filler of P or ask for all pairs (i, j) that are a filler of P .
 – Conjunctive Queries are a popular query formalism capable of expressing
   the class of selection/projection/join/renaming relational queries.
As ABox reasoning task we focus on answering conjunctive queries, as the vast
majority of query languages models used in practice fall into this fragment and
conjunctive queries have been found useful in diverse practical applications.

4.2    Performance Measures
Our primary performance measure is response time, i.e. the time that is needed
to solve the given reasoning task. This means that we ignore the utilization of
system resources, which could be another interesting measure for performance.
    In our benchmarks we separate load time and query time to fairly compare
the performance of the reasoners w.r.t. the different test ontologies:
 – Load Time (P): Includes the time to do some important preparation before
   querying, e.g. load ontologies and check ABox consistency.
 – Response Time (Q): Starts with executing the query and ends when all
   the query results were stored into a local variable. Usually, the query time
   means when a query is executed while not including the time for iterating
   the results.

4.3    Datasets and Queries
For the language fragments identified in Section 3, we chose a representative
ontology for each fragment. The ontologies were chosen for two reasons. Firstly,
they are popular, well established ontologies, which have been used in previous
benchmarks. Secondly, they represent the cluster of ontologies as identified in
section 3 in terms of size and ontological features used.
    For each ontology, we used different datasets with increasing ABox size. Apart
from one ontology (LUBM), which comes with its own ABox generator, the
datasets are generated by duplicating originally existing ABox axioms for several
times by renaming the individuals in these axioms. More details about these data
sets can be found in Table 1.
    For each ontology we used test queries that were either adopted from previous
benchmarks, or explicitly defined for this benchmark. In particular we focused
on conjunctive queries, as these are crucial in terms of complexity and response
time.

VICODI. As a representative of the RDFS(DL) fragment, we used the VICODI
ontology6, and the following two ABox queries adopted from Motik and Sat-
tler [11]:
                             Qv1 (x) ≡ Individual (x)                         (1)

         Qv2 (x, y, z) ≡ Military-Person(x), hasRole(y, x), related (x, z)    (2)
6
    http://www.vicodi.org
                        Table 1. Statistics of test ontologies.

Ontology Class Prop. SubCl. Equi. SubPr. Domain Range Functional C(a) R(a,b) Axioms total
vicodi 0                                                        16942 36711        53876
vicodi 1                                                        33884 73422       107529
vicodi 2 194    10    193     0     9      10    10       0     50826 110133      161182
vicodi 3                                                        67768 146844      214835
vicodi 4                                                        84710 183555      268488
swrc 0                                                           4124 13712        27227
swrc 1                                                           8248 27424        54328
swrc 2                                                          12372 41136        81429
swrc 3                                                          16496 54848       108530
swrc 4                                                          20620 68560       135631
swrc 5     55   41    115     0     0       0     1       0     24744 82272       162732
swrc 6                                                          28868 95984       189833
swrc 7                                                          32992 109696      216934
swrc 8                                                          37116 123408      244035
swrc 9                                                          41240 137120      271136
swrc 10                                                         45364 150832      298237
lubm 1                                                          18128 49336       100637
lubm 2                                                          40508 113463      230155
lubm 3     43   25     36     6     5      25    18       0     58897 166682      337221
lubm 4                                                          83200 236514      477878
wine 0                                                            247    246         719
wine 1                                                            741    738        1721
wine 2                                                           1235   1230        2723
wine 3                                                           1729   1722        3725
wine 4                                                           2223   2214        4727
wine 5    141   13    126    61     5       6     9       6      2717   2706        5729
wine 6                                                           5187   5166       10739
wine 7                                                          10127 10086        20759
wine 8                                                          20007 19926        40799
wine 9                                                          39767 39606        80879
wine 10                                                         79287 78966       161039


SWRC. As a representative for the DLP fragment, we used the Semantic Web
for Research Communities (SWRC) ontology7 . The ontology is clearly settled
above the RDFS(DL) fragment (in terms of expressiveness), since it contains
universal quantification. However, this occurs only in the superclass description
of class expressions, which keeps the ontology in the DLP fragment [6]. The
following queries have been used for our benchmark:

                              Qs1 (x) ≡ PhDStudent(x)                                 (3)

      Qs1 (x, y) ≡ ResearchTopic(x), isWorkedOnBy (x, “id2042instance”),              (4)
                   dealtWithIn(x, y)

LUBM. The Lehigh University Benchmark (LUBM)8 [7] was explicitly designed
for OWL benchmarks. It models a scenario of the university domain and comes
with its own ABox generator and a set of queries. Due to existencial restriction
on the right side of class expressions, the LUBM ontology is in OWL Lite and
just beyond the DLP fragment. We used the following three queries out of the
LUBM queries:

       Ql1 (x, y1 , y2 , y3 ) ≡ Professor (x), worksFor (x, “University0.edu”),       (5)
7
    http://ontoware.org/projects/swrc/
8
    http://swat.cse.lehigh.edu/projects/lubm/index.htm
                           mastersDegreeFrom(x, y1 ), teacherOf (x, y3 ),
                           undergraduateDegreeFrom(x, y2 )

             Ql2 (x) ≡ Person(x), memberOf (x, “University0.edu”)             (6)

           Ql3 (x, z) ≡ Student(x), Department(y), memberOf (x, y),           (7)
                        subOrganizationOf (y, “University0.edu”)

Wine. The Wine ontology9 is a prominent example of an OWL DL ontology.
Since some reasoners are not able to handle nominals, we used the same datasets
for the Wine ontology, as in previous benchmarks (cf. [11]), where nominals have
been removed. We defined the following three ABox queries:

                    Qw1 (x) ≡ SemillonOrSauvignonBlanc(x)                     (8)

          Qw2 (x) ≡ DessertWine(x), locatedIn(x, “GermanyRegion”)             (9)

            Qw3 (x) ≡ hasFlavor (x, “Strong”), hasSugar (x, “Dry”),          (10)
                      locatedIn(x, “NewZealandRegion”)

4.4    Discussion
We built our benchmark on several requirements, motivated by the work of
Weithöner et al. [15]. Firstly, we distinguish between load time and response
time. This distinction is necessary to demonstrate strengths and weaknesses of
reasoners, that follow different paradigms, in particular materialization of in-
ferred ABox assertions in the setup stage (e.g. Sesame, OWLIM). Our bench-
mark demonstrates how these approaches work on fairly simple ontologies (w.r.t.
TBox complexity), as well as more complex ontologies, which these reasoners
are unable to load at all. Secondly, we evaluate these measurements w.r.t. dif-
ferently scaled ABoxes, but constant TBox. By doing so, we picked up methods,
that have been used in previous benchmarks, which scale up the ABox by du-
plicating existing ABox assertions. While we evaluated reasoning tasks on four
different expressivity (complexity) classes regarding TBox complexity, we did
not put any effort in increasing TBox complexity for given ontologies. We rather
focused on representative ontologies of given complexity for different classes of
ontologies. We used only native interfaces of the different reasoners, and did not
consider higher abstraction layers such as DIG. The influence of different inter-
faces is negligible anyway for querying large ontologies [15]. We did not evaluate
cache influence or ABox changes on consecutive query requests, which goes be-
yond the scope of this paper. We also did not consider different serializations
of ontologies, as we focus on well established ontologies, that do not occur in
different serializations.
9
    http://www.schemaweb.info/schema/SchemaDetails.aspx?id=62
                           Table 2. Overview of Reasoners

                   Fragment RDFS(DL) DLP OWL Lite OWL DL
                   Example VICODI SWRC LUBM        Wine
                    Sesame     ×
                   OWLIM       ×      ×
                    KAON2      ×      ×    ×        ×a
                    HermiT     ×      ×    ×        ×
                   RacerPro    ×      ×    ×        ×a
                     Pellet    ×      ×    ×        ×
 a
     Except for nominals


5      Overview of Reasoners
In this section we provide a short overview of the reasoners we used for our
evaluations. Roughly, the reasoners can be grouped according to the employed
reasoning techniques into three groups: In the first class of traditional DL rea-
soners (e.g. RacerPro, Pellet), tableau based algorithms are used to implement
the inference calculus. A second alternative relies on the reuse of the techniques
of deductive databases, based on a transformation of an OWL ontology into a
disjunctive datalog program and to the utilization of a disjunctive datalog engine
for reasoning as implemented in KAON2. A final class of reasoners – including
Sesame and OWLIM – use standard rule engine to reason with OWL. Often
the consequences are materialized when the ontology is loaded. However, this
procedure is in principle limited to less expressive language fragments.
    Table 2 shows an overview of the reasoners along with the language fragments
they support.

5.1     Sesame
Sesame10 is an open source repository for storing and querying RDF and RDFS
information. OWL ontologies are simply treated on the level of RDF graphs.
Sesame enables the connection to DBMS (currently MySQL, PostgreSQL and
Oracle) through the SAIL (Storage and Inference Layer) module, and also offers
a very efficient direct-to-disk SAIL called Native SAIL, which we used for our
experiments. Sesame provides RDFS inferencing and allows querying through
SeRQL, RQL, RDQL and SPARQL. Via the SAIL it is also possible to extend
the inferencing capabilities of the system.

5.2     OWLIM
OWLIM is semantic repository and reasoner, packaged as a SAIL for the Sesame
RDF database. OWLIM uses the TRREE engine to perform RDFS, and OWL
DLP reasoning. It performs forward-chaining of entailment rules on top of RDF
10
     http://openrdf.org
graphs and employs a reasoning strategy, which can be described as total mate-
rialization. OWLIM offers configurable reasoning support and performance. In
the ”standard” version of OWLIM (referred to as SwiftOWLIM) reasoning and
query evaluation are performed in-memory, while a reliable persistence strategy
assures data preservation, consistency and integrity.


5.3     KAON2

KAON2 is a free (free for non-commercial usage) Java reasoner for SHIQ ex-
tended with the DL-safe fragment of SWRL. Contrary to most currently avail-
able DL reasoners does not implement the tableau calculus. Rather, reasoning in
KAON2 is implemented by novel algorithms which reduce a SHIQ(D) knowl-
edge base to a disjunctive datalog program. These novel algorithms allow apply-
ing well-known deductive database techniques, such as magic sets or join-order
optimizations, to DL reasoning


5.4     HermiT

HermiT is a freely available theorem prover for description logics. The reasoner
currently fully handles the DL SHIQ. The support for SHOIQ is currently be-
ing worked on. The main supported inference is the computation of the subsump-
tion hierarchy. HermiT can also compute the partial order of classes occurring
in an ontology. HermiT implements a novel hypertableau reasoning algorithm.
The main aspect of this algorithm is that it is much less non-deterministic than
the existing tableau algorithms. A description of the reasoning technique using
hypertableaux for SHIQ can be found in [12].


5.5     RacerPro

The RacerPro system [8] is an optimized tableau reasoner for SHIQ(D). For
concrete domains, it supports integers and real numbers, as well as various poly-
nomial equations over those, and strings with equality checks. It can handle
several TBoxes and several ABoxes and treats individuals under the unique
name assumption. Besides basic reasoning tasks, such as satisfiability and sub-
sumption, it offers ABox querying based on the nRQL optimizations. It is im-
plemented in the Common Lisp programming language. Recently, RacerPro has
been turned into the commercial (free trials and research licenses available) Rac-
erPro11 system, which we used for our experiments.


5.6     Pellet

Pellet12 [13] is a free open-source Java-based reasoner for SROIQ with simple
data types (i.e. for OWL 1.1). It implements a tableau based decision procedure
11
     http://www.RacerPro-systems.com/
12
     http://www.mindswap.org/2003/pellet/index.shtml
                   Table 3. Performance results of classification.

                      KB          vicodia swrc lubm wine
                      Sesame(P)    0.769 0.635 0.467 1.180
                      Sesame(Q)    0.099    -    -     -
                      OWLIM(P) 0.580 12.990 0.684 0.964
                      OWLIM(Q) 0.071 0.079 0.093b -
                      KAON2(P) 0.387 0.383 0.349 0.553
                      KAON2(Q) 2.746 1.137 1.010 5.141
                      HermiT(P)    0.889 0.776 0.708 1.090
                      HermiT(Q) 0.180 0.046 0.046 0.465
                      RacerPro(P) 0.110 0.092 0.072 0.168
                      RacerPro(Q) 0.080 0.056 0.356 0.607
                      Pellet(P)    0.563 0.518 0.404 0.835
                      Pellet(Q)    1.145 0.346 0.253 9.252
a
  The fact, that reasoners take longer to classify VICODI than SWRC despite higher
  expressivity of SWRC is due to the larger number of classes in VICODI (cf. Table 1).
b
  Even though OWLIM is not able to process OWL Lite ontologies in general, it is
  able to process the particular set of features LUBM uses.



for general TBoxes (subsumption, satisfiability, and classification) and ABoxes
(retrieval, conjunctive query answering). Pellet employs many of the optimiza-
tions for standard DL reasoning as other state-of-the-art DL reasoners. It directly
supports entailment checks and optimised ABox querying through its interface.


6   Experiments

In this section we report on the benchmarking experiments performed. A tech-
nical report describing the full results of the experiments is available at [1].
    Our tests were performed on a Linux 2.6.16.1 System. The Sun JavaTM 1.5.0
Update 6 was used for Java-based tools and the maximum heap space was set to
800 MB. For each reasoning task, the time-out period was assigned 5 minutes.
    For each reasoning task, a new instance of the reasoner is created and the
test ontology is loaded. No methods about optimization for the reasoners are
called, using default settings.


Classification. We consider classification as a representative task for TBox
reasoning. Table 3 compares the results for classification among the pure TBoxes
of the four ontologies selected for our tests, where ’P’ indicates load time and
’Q’ means classification time.
    Regarding the classification experiment, it can clearly be observed, that Rac-
erPro outperforms all other systems in terms of load time, while HermiT per-
forms best in terms of classification time for all test ontologies, but VICODI. It
should be noted, that despite best performance in actual classification, HermiT
is about one order of magnitude slower in loading the ontologies than RacerPro.
     The distinction between load time and classification time becomes important
if applications frequently operate on preloaded ontologies, where load time can
be neglected. In this case, HermiT should be the reasoner of choice, as described
in the previous paragraph.
     For those TBox reasoning tasks where the ontology has to be re-loaded for
each re-classification (e.g. loose coupling between ontology development tools
and reasoners), the distinction between load time and classification time becomes
less important. By disregarding this distinction in the classification experiment
and considering the total time only, RacerPro performs best in all expressivity
fragments. Another observation is that lightweight reasoning and storage sys-
tems such as Sesame and OWLIM do not bring any advantage in expressivity
fragments they are tailored to. Indeed, they are still outperformed by RacerPro
and perform only slightly better than HermiT and Pellet.
     Summing this up, the observation is, that tableau based systems generally
outperform the resolution based KAON2 for TBox reasoning tasks. If load time
is of minor importance, HermiT with its novel hypertableau method performs
best in classifying ontologies, apart from lightweight VICODI, where OWLIM
still performs better. If caching of preloaded ontologies is not possible and has
to be done for any (re-)classification, RacerPro clearly performs best w.r.t. total
execution time.


Conjunctive Queries. Figure 3 illustrates the results of the conjunctive query
experiments13 . W.r.t. RDFS(DL) ontologies, apparently there is a clear trade-off
between loading and query time as Sesame is the slowest system to load but the
fastest system to respond. For the VICODI datasets, the average query time
ranges from 0.18 to 0.33 seconds for Sesame.
    Actually OWLIM performs slightly worse than Sesame in responding for
RDFS(DL) ontologies and query time varies from 0.27 to 0.58 seconds. The
time to load shows much better results than Sesame. From the overall view,
OWLIM has better performance than Sesame. Considering the OWL DLP on-
tology SWRC, OWLIM is the fastest reasoner on query time which varies from
0.06 to 0.12 seconds.
    Obviously, KAON2 is the best system w.r.t the overall performance to load
and respond, and shows favorable scalability. Although Sesame and OWLIM
display good performance for the datasets they support, KAON2 is just slightly
slower but much faster to load. Pellet also responds faster than KAON2 in
most cases for expressive ontologies (except for RDFS(DL) ontologies). How-
ever, KAON2 is much faster in loading and more scalable than Pellet which
produces time-outs for Wine ontologies larger than wine 5. RacerPro responds
slightly faster for small Wine ontologies but is significantly less scalable than
KAON2, resulting in time-outs for the largest ontologies, such as vicodi 4 and
Wine datasets from wine 5. The performance of Pellet lags behind and produces
time-outs as soon as the ABox reaches medium size.
13
     At the time of this evaluation, conjunctive query answering was not implemented for
     HermiT yet. For that reason we used HermiT only in the classification experiment.
                                                      Load Time                                                                                               Conjunctive Query Time
                                                                                          ƻ                                                                                                              ƻ
                    300                                                                                                             100

                    250




Response time (s)




                                                                                                                Response time (s)
                                                                                                                                     80
                    200
                                                                                                                                     60
                    150
                                                                                                                                     40
                    100

                     50                                                                                                              20

                      0                                                                                                               0
                               KAON2         OWLIM             PELLET           RacerPro          SESAME                                   KAON2             OWLIM             PELLET          RacerPro          SESAME
                                                               System                                                                                                          System

                                        vicodi_0   vicodi_1     vicodi_2     vicodi_3     vicodi_4                                                      vicodi_0   vicodi_1    vicodi_2     vicodi_3     vicodi_4
                                                                                                                                                              Conjunctive Query Time
                                                     Load Time
                                                                                                                                                                                                                     ƻ ƻ ƻƻ
                                                                                                       ƻ ƻ ƻƻ                       100
                    300




                                                                                                                Response time (s)
                                                                                                                                    80
Response time (s)




                    250

                    200                                                                                                             60
                    150
                                                                                                                                    40
                    100
                                                                                                                                    20
                    50

                     0                                                                                                               0
                               KAON2               OWLIM                PELLET                RacerPro                                         KAON2               OWLIM                  PELLET               RacerPro
                                                              System                                                                                                           System

                      swrc_0     swrc_1     swrc_2     swrc_3      swrc_4        swrc_5       swrc_6                                  swrc_0    swrc_1       swrc_2      swrc_3     swrc_4         swrc_5       swrc_6
                      swrc_7     swrc_8     swrc_9     swrc_10                                                                        swrc_7    swrc_8       swrc_9      swrc_10
                                                                                                                                                              Conjunctive Query Time
                                                     Load Time
                                                                                                                                                                                          ƻƻ ƻ ƻ       ƻƻ ƻ          ƻ ƻ ƻ ƻƻ
                                                                           ƻƻ       ƻƻƻ              ƻ ƻ ƻ ƻƻ                       100
                    300




                                                                                                                Response time (s)
                                                                                                                                     80
Response time (s)




                    250

                    200                                                                                                              60
                    150
                                                                                                                                     40
                    100
                                                                                                                                     20
                    50

                     0                                                                                                                0
                                 KAON2                        PELLET                      RacerPro                                               KAON2                         PELLET                         RacerPro
                                                              System                                                                                                           System

                               lubm_1     lubm_2   lubm_3      lubm_4      wine_0    wine_1      wine_2                                        lubm_1     lubm_2      lubm_3    lubm_4      wine_0      wine_1      wine_2
                               wine_3     wine_4   wine_5      wine_6      wine_7    wine_8      wine_9                                        wine_3     wine_4      wine_5    wine_6      wine_7      wine_8      wine_9




Fig. 3. The performance of conjunctive queries for all datasets. The figures show the
average load and query time of all queries with a particular reasoner on a particular
dataset. “O” on top of a bar indicates time-out.




    For ABox reasoning tasks, in particular conjunctive query answering, the
choice of the reasoner depends on the expressivity of the ontology. Whereas
lightweight ontologies – in our case VICODI and SWRC for RDFS(DL) and
OWL DLP rsp. – can be materialized at the loading stage by Sesame and
OWLIM, they are unable to do so for ontologies of higher expressivity. In partic-
ular OWLIM performed very well, while still being able to process OWL DLP,
and hence should be the choice for ABox reasoning with lightweight ontologies.
For the more expressive language fragments, OWL Lite and OWL DL, KAON2
as resolution based reasoner outperforms the tableau based methods Pellet and
RacerPro. In fact, KAON2 is the only system that is able to answer queries for
all LUBM and Wine ontologies within the given time range of 5 minutes.
   Figure 4 depicts a differentiated case along two main dimensions: language
complexity and ABox size. While RacerPro is the system of chose in settings with
                               High
                                          0              WINE                       10




                   Language Complexity
                                                 RACER
                                                                              KAON2


                                                                1          LUBM          4




                                                   0                SWRC                      10




                                                 OWLIM




                               Low
                                                                1          VICODI        4


                                         Small                                               Large
                                                         A-Box Size


                  Fig. 4. Recommendation for reasoner selection



high complexity and small ABoxes, OWLIM can be generally recommended for
low complexity settings while KAON2 is the best alternative for all other cases.


7   Conclusion

In today’s landscape of ontologies, we observe a wide spectrum of ontologies that
differ in terms of language expressivity, as well as their complexity in terms of
their size of TBox and ABox. A number of rather different reasoning techniques
are implemented in state-of-the-art OWL reasoners. In our benchmarks we have
shown that it is important to understand the strengths and weaknesses of the
different approaches in order to select an adequate reasoner for a given reasoning
task. It does not come as a surprise that there is no clear “winner” that performs
well for all types of ontologies and reasoning tasks.
    As general conclusions we can summarize our results in that (1) reasoners
that employ a simple rule engine scale very well for large ABoxes, but are in
principle very limited to lightweight language fragments, (2) classical tableau
reasoners scale well for complex TBox reasoning tasks, but are limited with
respect to their support for large ABoxes, and (3) the reasoning techniques based
on reduction to disjunctive datalog as implemented in KAON2 scale well for
large ABoxes, while at the same time they support are rich language fragment.
If nominals are important for a given scenario, Pellet is the only reasoner in this
benchmark, which has adequate support.
    An important current research topic is the investigation of tractable frag-
ments of the OWL language [5, 10] and the development of reasoners specialized
for these fragments. During our analysis of the ontology landscape we identified
many of the ontologies on the web as erroneous, which hampers a satisfying ex-
pressivity analysis. As future work, we will extend our analysis taking new frag-
ments into account, as well as providing means to reveal more detailed statistics
of the ontology landscape on the web.
References
 1. J. Bock, P. Haase, Q. Ji, and R. Volz. Benchmarking OWL Reasoners - Techni-
    cal Report. Technical report, University of Karlsruhe, 2007. http://www.aifb.
    uni-karlsruhe.de/WBS/pha/publications/owlbenchmark.pdf.
 2. I. F. Cruz, S. Decker, D. Allemang, C. Preist, D. Schwabe, P. Mika, M. Uschold, and
    L. Aroyo, editors. The Semantic Web - ISWC 2006, 5th International Semantic
    Web Conference, ISWC 2006, Athens, GA, USA, November 5-9, 2006, Proceedings,
    volume 4273 of Lecture Notes in Computer Science. Springer, 2006.
 3. M. d’Aquin, C. Baldassarre, L. Gridinoc, S. Angeletou, M. Sabou, and E. Motta.
    Characterizing knowledge on the semantic web with Watson. In Proceedings of the
    5th International EON Workshop, International Semantic Web Conference (ISWC
    2007), Buasn, Korea, 2007.
 4. T. Gardiner, D. Tsarkov, and I. Horrocks. Framework for an automated comparison
    of description logic reasoners. In Cruz et al. [2], pages 654–667.
 5. B. Cuenca Grau. Tractable Fragments of the OWL 1.1 Web Ontology Lan-
    guage, February 2006. http://owl-workshop.man.ac.uk/Tractable.html, ac-
    cessed 22/06/2007.
 6. B. Grosof, I. Horrocks, R. Volz, and S. Decker. Description logic programs: Com-
    bining logic programs with description logics. In Proc. of WWW 2003, Budapest,
    Hungary, May 2003, pages 48–57. ACM, 2003.
 7. Y. Guo, Z. Pan, and J. Heflin. Lubm: A benchmark for owl knowledge base systems.
    J. Web Sem., 3(2-3):158–182, 2005.
 8. V. Haarslev, R. Möller, and M. Wessel. Querying the semantic web with racer
    + nrql. In Proceedings of the KI-2004 International Workshop on Applications of
    Description Logics (ADL’04), Ulm, Germany, September 24, 2004.
 9. C. M. Keet and M. Rodrı́guez. Comprehensiveness versus scalability: guidelines
    for choosing an appropriate knowledge representation language for bio-ontologies.
    Technical Report KRDB07-5, Faculty of Computer Science, Free University of
    Bozen-Bolzano, 2007.
10. M. Krötzsch, S. Rudolph, and P. Hitzler. Complexity boundaries for horn descrip-
    tion logics. In Proceedings of the 22nd AAAI Conference on Artficial Intelligence,
    pages 452–457, Vancouver, British Columbia, Canada, 2007. AAAI Press.
11. B. Motik and U. Sattler. A comparison of reasoning techniques for querying large
    description logic aboxes. In M. Hermann and A. Voronkov, editors, LPAR, volume
    4246 of Lecture Notes in Computer Science, pages 227–241. Springer, 2006.
12. B. Motik, R. Shearer, and I. Horrocks. Optimized reasoning in description logics
    using hypertableaux. In F. Pfenning, editor, CADE, volume 4603 of Lecture Notes
    in Computer Science, pages 67–83. Springer, 2007.
13. E. Sirin, B. Parsia, B. Cuenca Grau, A. Kalyanpur, and Y. Katz. Pellet: A Prac-
    tical OWL-DL Reasoner. Technical report, University of Maryland Institute for
    Advanced Computer Studies (UMIACS), 2005. http://mindswap.org/papers/
    PelletDemo.pdf.
14. T. D. Wang, B. Parsia, and J. A. Hendler. A survey of the web ontology landscape.
    In Cruz et al. [2], pages 682–694.
15. T. Weithöner, T. Liebig, M. Luther, S. Böhm, F. von Henke, and O. Noppens.
    Real-world Reasoning with OWL. In E. Franconi, M. Kifer, and W. May, editors,
    Proceedings of the European Semantic Web Conference, ESWC2007, volume 4519
    of Lecture Notes in Computer Science. Springer-Verlag, July 2007.