<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Manchester OWL Repository: System Description</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicolas Matentzoglu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Tang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bijan Parsia</string-name>
          <email>bparsia@cs.manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Uli Sattler</string-name>
          <email>sattler@cs.manchester.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Manchester Oxford Road</institution>
          ,
          <addr-line>Manchester, M13 9PL</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Tool development for and empirical experimentation in OWL ontology research require a wide variety of suitable ontologies as input for testing and evaluation purposes and detailed characterisations of real ontologies. Findings of surveys and results of benchmarking activities may be biased, even heavily, towards manually assembled sets of \somehow suitable" ontologies. We are building the Manchester OWL Repository, a resource for creating and sharing ontology datasets, to push the quality frontier of empirical ontology research and provide access to a great variety of well curated ontologies.</p>
      </abstract>
      <kwd-group>
        <kwd>Repository</kwd>
        <kwd>Ontologies</kwd>
        <kwd>Empirical</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Empirical work with ontologies comes in a wide variety of forms, for example
surveys of the modular structure of ontologies[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], surveys of modelling patterns to
inform design decisions of engineering environments [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and benchmarking
activities for reasoning services such as Description Logic (DL) classi cation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Since
it is generally di cult to obtain representative datasets, both due to technical
reasons (lack of suitable collections) and conceptual reasons (lack of agreement
on what they should be representative of), it is common practice to manually
select a somewhat arbitrary set of ontologies that usually supports the given
case. On top of that, few authors ever publish the datasets they used, often for
practical reasons (e.g. size, e ort), which makes reproducing experiment results
often impossible. The currently best option for ontology related research is the
BioPortal repository [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which provides a web based interface for browsing
ontologies in the biomedical domain and a REST web service to programmatically
obtain copies of all (public) versions of a wide range of biomedical ontologies.
There are, however, certain problems with this option. First, the repository is
limited to biomedical ontologies, which makes BioPortal unsuitable for surveys
that require access to ontologies of di erent domains. The second problem is the
technical barrier of accessing the web service: It requires a good amount of work
to download all interesting ontologies, for example due to a range of ontologies
published in a compressed form or the logistical hurdle of recreating new
snapshots over and over again. The third problem is due to the fact that there is
Sources
      </p>
      <p>Pool
Interface</p>
      <p>BioPortal
OWL/XML ORIGINAL</p>
      <p>OWL/XML
Web Frontend</p>
      <p>Web
crawl
Curation</p>
      <p>MOWLCorp</p>
      <p>OWL/XML ORIGINAL
Manchester OWL Repository</p>
      <p>Restful Webservice
Oxford Ontology</p>
      <p>Library</p>
      <p>OWL/XML ORIGINAL
ORIGINAL</p>
      <p>API
no shared understanding of what it means to \use BioPortal". Di erent authors
have di erent inclusion and exclusion criteria, for example they only take the
ones that are easily parseable after download, or the ones that were accessible
at a particular point in time. The Manchester OWL Respository aims to bridge
that gap by providing a framework for conveniently retrieving some standard
datasets and allowing users to create, and share, their own.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Overall architecture</title>
      <p>The Manchester OWL repository can be divided into four layers (see Figure 1).
The rst layer represents the data gathering. Through web crawls, web scrapes,
API calls, and user contributions ontologies are collected and stored in their
respective collections. The second layer represents the three main data sources of
the repository, each providing ontologies in their original and curated (OWL/XML)
form. The third layer, the pool, represents a virtual layer in which access to the
ontologies is uni ed, providing some means for de-dupli cation because of the
possibility of corpora intersection. Lastly, the interface layer provides access to
the repository through a REST service and a web-based front end.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Data Gathering</title>
      <p>The main component of the data gathering layer is a web crawl based on
crawler4j, a java-based framework for custom web crawling and daily calls to
the Google Custom Search API that lls the MOWLCorp, which makes up
the bulk of the repository's data. An ongoing BioPortal downloader creates a
snapshot of BioPortal once per month using the BioPortal web services, whilst
retaining copies of all available versions so far. The third (minor) component of
the repository is a web scrape of the Oxford Ontology Library (OOL), a hand
curated set of ontologies which features some particularly di cult, and thus
interesting to reasoner developers, ontologies. Ontologies are downloaded in their
raw form and thrown in the curation pipeline.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Data curation</title>
      <p>
        Ontology candidates from all three sources undergo a mild form of repair
(undeclared entity injection, rewrite of non absolute IRIs) and are exported into
OWL/XML, with their imports closure merged into a single ontology, while
retaining information about the axiom source ontology through respective
annotations. Metrics and les for both the original and the curated versions of
the ontologies are retained and form part of the repository. The data curation
looks slightly di erent for all three data sources, especially with respect to
ltering. Apart from the criterion of OWL API [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] parse-ability, BioPortal and the
OOL are left un ltered because they are already deemed curated. This means
that some ontologies in the corpus may not contain any logical axioms at all.
In MOWLCorp, on the other hand, we lter out ontologies that 1) have an
empty TBox (root ontology) and 2) have byte-identical duplicates after
serialisation into OWL/XML. The reason for the rst step is our focus on ontologies
(which excludes pure collections of RDF instance data) and the fact that the
imports closure is part of the repository, i.e., they are downloaded and evaluated
independently of the root ontology.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Accessing the repository</title>
      <p>There are currently three di erent means to access the repository: 1) A web
frontend1 provides access to preconstructed datasets and their descriptions, 2)
an experimental data set creator allows users to create custom datasets based
on a wide range of metrics and 3) an experimental REST-based web service that
allows users to create a dataset using the REST API. Since 2) is based on 3), we
now describe the query language that allows users to create their own datasets
and access the web service.</p>
      <p>The query language allows the user to construct statements that represent
lter criteria for ontologies based on some essential metrics such as axiom and
entity counts, or pro le membership. It roughly conforms to the following
grammar:</p>
      <p>q = comp f("&amp;&amp;"j"jj") compg
comp = metric (\&gt;=" j \&lt;=" j \=") n
metric = \axiom count" j \class count" j ...
where \metric" should be a valid metadata element. The query language parser
was built with open-source parser generator Yacc and Lex.</p>
      <p>The repository web services are built using the PHP framework Laravel.
Laravel is an advanced framework which implements the REST protocol, so that
users can get access to the services using a REST client, or simply using a web
1 http://mowlrepo.cs.manchester.ac.uk/</p>
      <p>method param return
service
query
url
/api/
check status /api/checkStatus/ GET
download /api/resource GET</p>
      <p>POST query JSON array with elds:status, count,size,</p>
      <p>message, progress
id JSON array with elds: status, progress
id le stream
browser and web-based tools such as Curl. For now, we have implemented three
services: query, checkStatus and download. The query service accepts a query
string that complies to the query language and returns an id string. Afterwards,
users can use the id string to check the status of their query, and to download
the nal dataset using checkStatus and download services.</p>
      <p>The usage of the services are listed in the Table 1; note that urls should be
appended to mowlrepo.cs.manchester.ac.uk which has been omitted.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Next steps</title>
      <p>We have presented the Manchester OWL Repository and a range of prototype
interfaces to access pre-constructed datasets and create custom ones. We believe
that the repository will help pushing the quality frontier of empirical
ontologyrelated research by providing access to shareable, well curated datasets. We
are currently working on the REST services, the dataset creator and improved
dataset descriptions. In the near future, we are aiming to 1) integrate the
repository with Zenodo, a service that allows hosting large datasets that are citable
via DOIs, 2) extend our metadata to capture even more ontology properties (in
particular consistency and coherence) and 3) improving the curation pipeline by
implementing extended yet save xes for OWL DL pro le violations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Del Vescovo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Klinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          , U. Sattler,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schneider</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsarkov</surname>
          </string-name>
          .
          <article-title>Empirical study of logic-based modules: Cheap is cheerful</article-title>
          .
          <source>In Lecture Notes in Computer Science (including subseries Lecture Notes in Arti cial Intelligence and Lecture Notes in Bioinformatics)</source>
          , volume
          <volume>8218</volume>
          LNCS, pages
          <volume>84</volume>
          {
          <fpage>100</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Goncalves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bail</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jimenez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Matentzoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Parsia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Glimm</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kazakov. OWL Reasoner</surname>
          </string-name>
          <article-title>Evaluation (ORE</article-title>
          ) Workshop 2013 Results:
          <article-title>Short Report</article-title>
          . In ORE, pages
          <volume>1</volume>
          {
          <fpage>18</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Horridge</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bechhofer</surname>
          </string-name>
          .
          <article-title>The OWL API: A Java API for OWL ontologies</article-title>
          .
          <source>Semantic Web</source>
          ,
          <volume>2</volume>
          :
          <fpage>11</fpage>
          {
          <fpage>21</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Horridge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tudorache</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vendetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nyulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Musen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Noy</surname>
          </string-name>
          .
          <article-title>Simpli ed OWL Ontology Editing for the Web: Is WebProtfeggfeg Enough</article-title>
          ? In
          <source>International Semantic Web Conference (1)</source>
          , pages
          <fpage>200</fpage>
          {
          <fpage>215</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. H.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Whetzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dorf</surname>
          </string-name>
          , N. Gri th, C. Jonquet,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Storey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Chute</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Musen</surname>
          </string-name>
          .
          <article-title>BioPortal: Ontologies and integrated data resources at the click of a mouse</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>37</volume>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>