<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Storage and Reasoning Systems Evaluation Campaign 2010</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mikalai Yatskevich</string-name>
          <email>mikalai.yatskevich@comlab.ox.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ian Horrocks</string-name>
          <email>ian.horrocks@comlab.ox.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francisco Martin-Recuerda</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgos Stoilos</string-name>
          <email>giorgos.stoilos@comlab.ox.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Oxford University Computing Laboratory</institution>
          ,
          <addr-line>Wolfson Building, Parks Road, Oxford OX1 3QD</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Politcnica de Madrid</institution>
          ,
          <addr-line>Campus de Montegancedo, sn 28660 Boadilla del Monte</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>According to the ISO-IEC 9126-1 standard [5], interoperability is a software functionality sub-characteristic de ned as \the capability of the software product to interact with one or more speci ed systems". In order to interact with other systems a DLBS must conform to the standard input formats and must be able to perform standard inference services. In our setting, the standard input format is the OWL 2 language. We evaluate the standard inference services: 3 http://www.w3.org/TR/2009/PR-owl2-conformance-20090922/</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>{ Class satis ability;
{ Ontology satis ability;
{ Classi cation;
{ Logical entailment.</p>
      <p>The last two are de ned in the OWL 2 Conformance document3, while the rst
two are extremely common tasks during ontology development, and are de facto
standard tasks for DLBSs.</p>
      <p>The performance criterion relates to the e ciency software characteristic
from ISO-IEC 9126-1. E ciency is de ned as \the capability of the software to
provide appropriate performance, relative to the amount of resources used, under
stated conditions". We take a DLBS's performance as its ability to e ciently
perform the standard inference services. We will not consider the scalability
criterion for the Storage and Reasoning Systems Evaluation Campaign 2010
because suitable test data is not currently available. The reason for this is the
fact that while hand crafted ontologies can be tailored to provide interesting
tests, at least for particular systems it is very di cult to create hand crafted
ontologies that are resistant to various optimizations used in modern systems.
Furthermore, hand crafted ontologies are rather arti cial since their potential
models often are restricted to those having a very regular structure. Synthetic
DL formulas may be randomly generated [4, 7, 9]. Thus, no correct answer is
known for them in advance. There have been extensive research on random ABox
generation in recent years [2, 6]. These works are tailored to query answering
scalability evaluation. Real-world ontologies provide a way to assess the kind
of performance that DLBSs are likely to exhibit in end-user applications, and
this is by far the most common kind of evaluation found in recent literature.
However, it is not a trivial task to create a good scalability test involving
realworld ontologies. To the best of our knowledge, no such tests are available at
the moment. The particular problems are parametrization and uniformity of the
input.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Evaluation Metrics</title>
      <p>The evaluation must provide informative data with respect to DLBS
interoperability. We use the number of tests passed by a DLBS without parsing errors is a
metric of a system's conformance to the relevant syntax standard. The number
of inference tests passed by a DLBS is a metric of a system's ability to perform
the standard inference services. An inference test is counted as passed if the
system result coincides with a \gold standard". In practice, the \gold standard"
is either produced by a human expert or computed. In the latter case, the
results of several systems should coincide in order to minimize the in uence of
implementation errors. Moreover, systems used to generate the \gold standard"
should be believed to be sound and complete, and should be known to produce
correct results on a wide range of inputs.</p>
      <p>The evaluation must also provide informative data with respect to DLBS
performance. The performance of a system is measured as the time the system
needs to perform a given inference task. We also record task loading time to assess
the amount of preprocessing used in a given system. It is di cult to separate
the inference time from loading time given that some systems perform a great
deal of reasoning and caching at load time, while others only perform reasoning
in response to inference tasks. Thus, we account for both times re ecting the
diversity in DLBSs behavior.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation Process</title>
      <p>We evaluate both interoperability and performance for the standard inference
services. Our evaluation infrastructure requires that systems either conform to
the OWL API 3 [3]. The output of an evaluation is the evaluation status. The
evaluation status is one of the following TRUE, FALSE, ERROR, UNKNOWN.
TRUE is returned if ontology and ontology class are satis able and in the case
the entailment holds. FALSE is returned if ontology and ontology class are
unsatis able and in the case the entailment does not hold. ERROR indicates a
parsing error. UNKNOWN is returned if a system is unable to determine an
evaluation results.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Testing Data</title>
      <p>Our collected data set contains most of the ontologies that are well established
and widely used for testing DBLS's inference services. More precisely, it contains:
{ The ontologies from the Gardiner evaluation suite. This suite now contains
over 300 ontologies of varying expressivity and size. The test suite was
originally created speci cally for the purpose of evaluating the performance of
ontology satis ability and classi cation of DLBSs [1]. It has since been
maintained and extended by the Oxford University Knowledge Representation
and Reasoning group4, and has been used in various other evaluations (e.g.,
[8]).
{ Various versions of the GALEN ontology [10]. The GALEN ontology is a
large and complex biomedical ontology which has proven to be notoriously
di cult for DL systems to classify, even for modern highly optimized ones.
For this reason several \weakened" versions of GALEN have been produced
by system developers in order to provide a subset of GALEN which some
reasoners are able to classify.
{ Various ontologies that have been created in EU funded projects, such as
SEMINTEC, VICODI and AEO.</p>
      <p>We use the OWL 2 conformance document as a guideline for conformance
testing data. In particular, we aim at semantic entailment and non-entailment
conformance tests. 148 entailment tests and 10 non-entailment tests from the
OWL 2 test cases repository5 are used for evaluating a DLBS's conformance.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We have provided a general framework for evaluating advanced reasoning
systems. We also described Storage and Reasoning Systems Evaluation Campaign
2010 evaluation design, including criteria and metrics de nitions, test data, test
work ows and API methods to be implemented by evaluation participants.
Furthermore, we described the ontologies to be used as testing data in the evaluation
campaign.
4 http://web.comlab.ox.ac.uk/activities/knowledge/index.html
5 http://owl.semanticweb.org/page/OWL 2 Test Cases
2. Y. Guo, Z. Pan, and J. He in. Lubm: A benchmark for owl knowledge base systems.</p>
      <p>Web Semantics: Science, Services and Agents on the World Wide Web, 3(2-3):158
{ 182, 2005.
3. M. Horridge and S. Bechhofer. The OWL API: A Java API for Working with OWL
2 Ontologies. In Rinke Hoekstra and Peter F. Patel-Schneider, editors, OWLED,
volume 529 of CEUR Workshop Proceedings. CEUR-WS.org, 2008.
4. I. Horrocks, P. F. Patel-Schneider, and R. Sebastiani. An analysis of empirical
testing for modal decision procedures. Logic Journal of the IGPL, 8(3):293{323,
2000.
5. ISO/IEC. ISO/IEC 9126-1. Software Engineering { Product Quality { Part 1:</p>
      <p>
        Quality model. 2001.
6. L. Ma, Y. Yang, Z. Qiu, G. T. Xie, Y. Pan, and S. Liu. Towards a complete OWL
ontology benchmark. In ESWC, pages 125{139, 2006.
7. F. Massacci and F. M. Donini. Design and results of tancs-2000 non-classical
(modal) systems comparison. In TABLEAUX '00: Proceedings of the International
Conference on Automated Reasoning with Analytic Tableaux and Related Methods,
pages 52{56, London, UK, 2000. Springer-Verlag.
8. B. Motik, R. Shearer, and I. Horrocks. Hypertableau reasoning for description
logics. J. of Arti cial Intelligence Research, 2009. To appear.
9. P. F. Patel-Schneider and R. Sebastiani. A new general method to generate
random modal formulae for testing decision procedures. J. Artif. Intell. Res. (JAIR),
18:351{389, 2003.
10. A. L. Rector and J. Rogers. Ontological and practical issues in using a description
logic to represent medical concept systems: Experience from galen. In Reasoning
Web, pages 197{23
        <xref ref-type="bibr" rid="ref1">1, 2006</xref>
        .
11. D. Tsarkov, I. Horrocks, and P. F. Patel-Schneider. Optimizing terminological
reasoning for expressive description logics. J. Autom. Reasoning, 39(3):277{316,
2007.
12. T. D. Wang, B. Parsia, and J. Hendler. A survey of the web ontology landscape.
      </p>
      <p>In Proceedings of the International Semantic Web Conference, ISWC, 2006.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>T.</given-names>
            <surname>Gardiner</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Horrocks</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsarkov</surname>
          </string-name>
          .
          <article-title>Automated benchmarking of description logic reasoners</article-title>
          .
          <source>In Proc. of the 2006 Description Logic Workshop (DL</source>
          <year>2006</year>
          ), volume
          <volume>189</volume>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>