<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging logical rules for e cacious representation of large orthology datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tarcisio M. de Farias</string-name>
          <email>tarcisio.mendesdefarias@unil.ch</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hirokazu Chiba</string-name>
          <email>chiba@dbcls.rois.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jesualdo T. Fernandez-Breis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Database Center for Life Science (DBCLS)</institution>
          ,
          <addr-line>ROIS</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Departamento de Informatica y Sistemas, Universidad de Murcia, IMIB-Arrixaca</institution>
          ,
          <addr-line>30100 Murcia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Computational Biology, University of Lausanne</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>SIB Swiss Institute of Bioinformatics</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the semantic web applied to life sciences, ontologies provide a basis to de ne concepts and to describe data in biological databases, thereby facilitate data interoperability across multiple resources. In the context of evolutionary genetics, the best corresponding genes across different species (e.g. the insulin genes in the pig and the human) are called \orthologs". Dozens of bioinformatic resources identify and describe such orthologs. To represent the orthology content, an OWL-based orthology ontology (ORTH) was recently proposed. However, ORTH ontology lacks a basis to infer pairwise relations between genes besides more speci c and accurate de nitions of class restrictions, property domains and property ranges - which is hampering wider adoption by orthology resources. To address this issue, we present in this paper our common e orts to de ne a release candidate of a second version of ORTH ontology. By using this ontology, we propose a logical rule-based approach to infer information which is not explicitly de ned in the primary data. As a bene t of our approach, for example, we can avoid the materialization of several billion triples to represent \is orthologous to" relation when considering the Orthologous Matrix (OMA) dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>ORTH ontology</kwd>
        <kwd>OWL</kwd>
        <kwd>Horn-like rule</kwd>
        <kwd>ortholog</kwd>
        <kwd>paralog</kwd>
        <kwd>orthology database</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The shared genes among di erent species are evidence of evolution from a
common ancestor. For example, we share approximately 90% of our genes with mice.
These related genes are called orthologs. Orthologs are genes in di erent species
that evolved from a common ancestral gene by a speciation event. These genes
are normally thought to retain the same function. The functional conservation
of related genes across species explains the success of model organism-based
research, which enables knowledge on human biology and medicine to be gained
from other species, such as mice, fruit y, or yeast. In this context, knowledge
of the orthologs between, say, mice and humans allows for studying biological
processes in mice, and then transferring the knowledge to humans.</p>
      <p>
        In the eld of life sciences, ontologies have been identi ed as a key
fundamental technology to achieve data interoperability across multiple resources and
to annotate data, the Gene Ontology [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] being the most popular and successful
one. The interest in ontologies in biomedicine can be illustrated by the fact that
repositories such as BioPortal [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] contain at the time of writing more than six
hundred biomedical ontologies, terminologies and controlled vocabularies. The
community of orthology researchers has increased its interest for ontologies in
the last years since the creation of the Quest for Orthologs (QfO) consortium1.
QfO pursues the standardization and interoperability of orthology resources and
methods, including the development of common standards and formats for the
representation of orthology information and knowledge. The 2013 QfO meeting
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] identi ed the potential bene ts of semantic web technologies for the
interoperability of orthology information. Since then, QfO researchers developed the
rst version of the Orthology Ontology (ORTH)2, which served to demonstrate
the feasibility of creating semantically interoperable orthology resources [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>The experience with the ORTH has shown some limitations for the activities
needed by the QfO community. More concretely, new orthology-related concepts
need to be formalized in the ontology and some aspects of the current
representation need to be improved in order to permit a more powerful, reasoning-based
exploitation of orthology data. In this paper, we will justify why such changes
are necessary in the ORTH and will present our common e orts to de ne a
release candidate (RC) of a second version of the ORTH. Besides, we examine and
compare the performance of two ways for executing queries that require
inferencing. The main goal of this evaluation is to nd the most appropriate approach
to infer pairwise orthology relations without needing to materialize them, since
that would increase signi cantly the number of triples to store in the already
large orthology datasets. Therefore, the main contribution of this paper is how to
e caciously store orthology information using the Resource Description
Framework (RDF). The extension and re-engineering of the ORTH ontology are only
a step to achieve this goal.</p>
      <p>The structure of the rest of the paper is described next. In Section 2, we will
provide some background on orthology and on inferencing using semantic web
content. Section 3 will present the changes made to the ORTH. The method
for inferring pairwise orthology relations will be explained in Section 4. The
experimental results of comparing the execution of inference-based queries to
obtain pairwise orthology relations will be shown and discussed in Section 5.
Finally, some conclusions will be put forward in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>1 https://questfororthologs.org/</title>
    </sec>
    <sec id="sec-3">
      <title>2 http://purl.org/net/orth</title>
      <sec id="sec-3-1">
        <title>Background</title>
        <sec id="sec-3-1-1">
          <title>Basic concepts about orthology</title>
          <p>De nition 1. Homologs are genes related to each other by descent from a
common ancestry. Homology is a more general term to de ne the relationship between
genes separated by a speciation event (see De nition 2 for Ortholog) or the
relationship between genes separated by a genetic duplication event (see De nition
3 for Paralog).</p>
          <p>
            De nition 2. Orthologs are genes in di erent species that evolved from a
common ancestral gene by speciation. The orthologs are normally thought to retain
the same function in the course of evolution [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ].
          </p>
          <p>
            De nition 3. Paralogs are genes related by duplication. Unlike the general thought
for orthologs (see De nition 2), paralogs are more likely to evolve new functions.
Paralogs can be classi ed as inparalog and outparalog [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ].
          </p>
          <p>
            De nition 4. Xenologs are homologous genes that are neither orthologs nor
paralogs according to above de nitions, but appear to be orthologous in genome
comparisons [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. They occur due to horizontal gene transfer [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ].
          </p>
          <p>
            De nition 5. Hierarchical Orthologous Groups (HOGs) are de ned as sets of
genes that have descended from a single common ancestor within a taxonomic
range of interest [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. In the computer science context, the data structure to
represent a HOG is a Tree.
2.2
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Inference-based exploitation of orthology content</title>
          <p>
            There is little experience in the optimization of queries on large RDF orthology
datasets. In [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], SPARQL queries were used for obtaining pairwise orthology
relations, and those queries required the use of some properties de ned in the
ORTH in a transitive way. Such inferencing capability has to be provided by
the triple store supporting SPARQL1.1. In previous works, such queries were
executed over a series of graphs available in the same triple store. In [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ], the
authors use the ORTH to compose conjunctive queries over various knowledge bases
(KBs) such as Microbial Genome Database (MBGD) 3 and Universal Protein
Resource (UniProt) 4, although they did not investigate possible optimizations
for executing inference-based SPARQL queries.
          </p>
          <p>
            SPARQL query rewriting is a query optimization approach whose popularity
has increased signi cantly in recent years, and it is especially useful when
inferencing is an important component in the execution of the queries [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. SPARQL
query rewriting is based on changing the graph pattern included in the query,
ensuring that the semantics of the query is preserved by using mappings
between the query elements and the ontology. The rewriting can a ect the subject,
predicate or object of the triples of the query patterns.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3 http://mbgd.genome.ad.jp</title>
    </sec>
    <sec id="sec-5">
      <title>4 http://www.uniprot.org</title>
      <p>Languages such as SWRL5, RIF6 or SPIN7 also permit to use inferencing
in data exploitation. SWRL and RIF permit the de nition and the execution of
Horn-like rules, and SPIN is built on top of SPARQL. However, neither SPARQL
query rewriting or the other mentioned languages have been explored to the best
of our knowledge as solutions for the exploitation of large orthology datasets.
3</p>
      <sec id="sec-5-1">
        <title>Constructing the updated ontology</title>
        <p>One of the main advantage of a DL-based ontology for knowledge
representation is leveraging Horn-like rules to infer information which is not explicitly
described in the primary data. In the context of recent genomics, leveraging
inference enables us to store a large dataset in a compact form by retrieving
implicit information on demand (see Section 4 for further details). However, the
previously published ORTH ontology has several issues to be addressed in order
to take advantage of the DL-based ontological representation:
1. The ORTH ontology is not fully compliant with OWL 2 DL due to ontologies
imported.
2. There are not properties to describe pairwise relations between genes.
3. Missing de nitions of property's domain and range.
4. Class restrictions need to be reviewed.
5. Missing several species in the imported taxonomy ontology.</p>
        <p>In the following paragraphs, we present how we solve those issues. For the
sake of simplicity, in the rest of this paper we omit the namespace pre xes
whenever it does not compromise the understandability.</p>
        <p>DL compliance. The rst release of the ORTH ontology8 asserts that rdfs
:Resource v &gt; (i.e. rdfs:Resource a owl:Class ) and &gt; v 8hasSource.rdfs:Resource.
Nevertheless, in the OWL 2 DL pro le for the sake of decidability, an
entity can not be an instance and a class at the same time. As a reminder, the
rdfs:Resource is an instance of rdfs:Class and owl:Class is a subclass of rdfs:Class.
Therefore, not all RDFS classes are legal OWL DL classes. Although, in terms
of data modeling this issue is not a relevant problem, without xing this we
can not take advantage of the available reasoning tools. These tools are
fundamentally important to our Horn-like rule-based approach presented in Section
4. To address this rst issue, we removed the axioms rdfs:Resource v &gt; and
&gt; v 8hasSource.rdfs:Resource.</p>
        <p>Pairwise relations. In genetics, we can relate genes according to a
common ancestral DNA sequence such as homolog, ortholog, paralog, xenolog,
inparalog and outparalog relationships. The rst version of ORTH ontology
permits to obtain the pairwise relations by means of SPARQL queries over the</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5 https://www.w3.org/Submission/SWRL/</title>
    </sec>
    <sec id="sec-7">
      <title>6 https://www.w3.org/TR/rif-overview/</title>
    </sec>
    <sec id="sec-8">
      <title>7 http://spinrdf.org/</title>
    </sec>
    <sec id="sec-9">
      <title>8 https://bioportal.bioontology.org/ontologies/ORTH</title>
      <p>semantic, representation of the HOGs, but does not contain properties to
assert these relations between genes. However, being able to represent, persist
and exploit such relations is needed for some exploitation scenarios. To be
able to represent the pairwise relations, we include the axioms in Listing 3.1.
&gt; v 8hasHomolog:SequenceUnit
9hasHomolog:&gt; v SequenceUnit
hasOrtholog v hasHomolog
hasP aralog v hasHomolog
hasXenolog v hasHomolog</p>
      <p>Listing 3.1. The axioms added to describe homologous pairwise relations.
Similar properties to hasHomolog, hasOrtholog and hasParalog already exist in
the Semanticscience Integrated Ontology (SIO) ontology. However, SIO does not
specify the domain and range of these properties. Moreover, SIO is a more
general purpose ontology, it has been reused in ORTH. Nonetheless, for the sake
of interoperability, we can state that the ORTH ontology pairwise relations are
subproperties of their correspondent SIO properties when exist.</p>
      <p>Property and class restrictions. To exemplify a property's range
modication, we modi ed the range of the hasCluster property from GeneTreeNode
into HomologsCluster class. This is because the property value must not be a
gene but a cluster. Further details of changes in class restrictions and property's
domain and range in the ORTH ontology are available on the following URL:
https://github.com/qfo/OrthologyOntology.</p>
      <p>Species taxonomy ontology. The NCBI organismal taxonomy ontology
used in the rst version of ORTH ontology refers to a view of the NCBITaxon
ontology9. Thus, it does not describe an exhaustive list of species. Because of
this, we replaced the NCBI 1 class with NCBITaxon 1 that is the root taxonomy
class in the NCBITaxon ontology.</p>
      <p>
        Several classes in life sciences related ontologies are not supposed to be
instantiated or they are singleton classes (i.e. the class is only instantiated once).
Some examples are the classes of the following life science ontologies: Gene
Ontology [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], UBERON ontology [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], SIO ontology and also NCBITaxon ontology.
Therefore, when importing the NCBITaxon ontology along with the new
version of ORTH ontology, one instance must be created for each species classes
to assign the `in taxon' property for a SequenceUnit instance, which is done
using the Punning10 feature of OWL 2. This class instantiation is necessary to
be DL compliant because a NCBITaxon class can not be directly assigned to
the `in taxon' property. As a reminder, only an instance can be a value of an
object property. Further analysis of the drawbacks of de ning a large
Terminological Box (TBox) with singleton classes instead of having a smaller TBox with
a relevant Assertional Box (ABox) are beyond of the scope of this paper. For
information, the NCBITaxon ontology contains about 1,600,000 classes.
      </p>
      <p>To build the new RC ORTH Ontology, we made 27 modi cations in the
previous ORTH ontology version that include adding and removing properties,
prop</p>
    </sec>
    <sec id="sec-10">
      <title>9 http://www.obofoundry.org/ontology/ncbitaxon.html 10 https://www.w3.org/TR/owl2-new-features/#F12:_Punning</title>
      <p>erty domain, property range, classes and class restrictions. A full description of
these modi cations is available on https://github.com/qfo/OrthologyOntology.
The RC ORTH ontology is available to download on the following URL:
http://purl.org/net/orth_rc.
4</p>
      <sec id="sec-10-1">
        <title>Inferring pairwise relations from hierarchical structures</title>
        <p>End-users are typically interested in pairwise relationships such as \is
orthologous to". Because of this, from now on by considering the RC ORTH ontology
(DL-based) that is described in Section 3, we can assert pairwise relations
between genes. However, today's orthology information providers store all pairwise
relationships, which grow quadratically with the number of genes or genomes.
To address this problem, we capture the implicit information of pairwise
relationships with an inference engine. This information is implicitly structured in
HOGs (see Section 2 for further details). In doing so, the data to be stored and
retrieved scales linearly. For example, we do not need to store pairwise orthologs
between species because they can be inferred by applying the R1 Horn-like rule
shown in Listing 4.1. Thus, with our approach we can infer new information
instead of materializing it. For example, we can avoid the materialization of
6,464,814,646 triples to explicitly de ne orthologous relationships when
considering solely 1,048,561 out 4,172,982 orthologous clusters in the latest
Orthologous Matrix (OMA) database (DB) release. For comparison reasons, by using
the HOGs, we solely need 16,911,449 triples to implicitly de ne the pairwise
orthologs from HOGs in OMA.</p>
        <p>R1: OrthologsCluster(cluster)^ hasHomologousMember(cluster; node1) ^ hasHomologousMember
(cluster; node2)^ `has part'(node2; seq2) ^ `has part'(node1; seq1)^ SequenceUnit(seq1)^
SequenceUnit(seq2) ^ (node1 6= node2) ! hasOrtholog(seq1; seq2)
R2: ParalogsCluster(cluster)^ hasHomologousMember(cluster; node1) ^ hasHomologousMember
(cluster; node2)^ `has part'(node2; seq2) ^ `has part'(node1; seq1)^ SequenceUnit(seq1)^
SequenceUnit(seq2) ^ (node1 6= node2) ! hasParalog(seq1; seq2)
Listing 4.1. The Horn-like rules that infers the hasOrtholog (R1) and hasParalog(R2)
properties for a given SequenceUnit instance (e.g. Gene instance).</p>
        <p>
          Listing 4.2 contains the equivalent subquery to the R1 rule in Listing 4.1
to retrieve the implicit hasOrtholog assertions. This subquery can be used with
a SPARQL query rewrite approach [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] to infer the hasOrtholog relations
between genes (or proteins). Therefore, it is an alternative solution to a general
purpose inference engine. For example, triple stores which does not fully
support reasoning can consider Listing 4.2 subquery to replace the occurrences of
hasOrtholog in the original SPARQL query. For example, let us suppose the
following SPARQL query SELECT * f ?g1 :hasOrtholog ?g2. ?g1 :geneName
`APOC1'. g. By parsing this query, a SPARQL query rewrite approach
identies the basic graph pattern (BGP) ?g1 :hasOrtholog ?g2 that is replaced with
the graph between braces in Listing 4.2 by also considering variable names (e.g.
?seq 1 is replaced with ?g1 ). The expanded query is then executed in a SPARQL
endpoint (i.e. triple store). Moreover, in Section 5, we present the performance
in terms of query execution time and retrieved results along with a discussion
about the bene ts and drawbacks of both approaches.
        </p>
        <p>SELECT ?seq_1 ?seq_2 {
?cluster a :OrthologsCluster.
?cluster :hasHomologousMember ?node_1.
?cluster :hasHomologousMember ?node_2.
?node_1 :hasHomologousMember* ?seq_1.
?node_2 :hasHomologousMember* ?seq_2.
{?seq_1 a :Gene. ?seq_2 a :Gene.} UNION
{?seq_1 a :Protein. ?seq_2 a :Protein.}</p>
        <p>FILTER (?node_1 != ?node_2)}
Listing 4.2. The subquery to assert the hasOrtholog property for a given SequenceUnit
instance (e.g. Gene or Protein instance).</p>
        <p>The R2 rule in Listing 4.1 is a Horn-like rule to infer hasParalog property.
The equivalent SPARQL subquery for hasParalog is similar to the subquery
in Listing 4.2 except by the fact that the rst triple in Listing 4.2 ?cluster a
:OrthologsCluster is replaced with ?cluster a :ParalogsCluster.</p>
        <p>Some resources actually use orthologous clusters as homologous clusters. To
solve this issue at the query level, we can add a condition in the R1 rule in
Listing 4.1 and the query in Listing 4.2 to only consider genes/proteins in di
erent species (i.e. orthologs). Nevertheless, the concepts of homolog and ortholog
should not be misleading.</p>
        <p>As a consequence of our proposed Horn-like rule-based approach, we can
also make it easier to write queries for retrieving orthology information since the
second version of the ORTH ontology is a more ne-grained ontology. There are
property values assigned by applying Horn-like rules (e.g. Semantic Web Rule
Language rules) at query execution time.
5</p>
      </sec>
      <sec id="sec-10-2">
        <title>Results and Discussion</title>
        <p>To further justify the gain in terms of storage by inferring pairwise relations
instead of materializing them, we inferred about 8,034,238,900 hasParalog
assertions between proteins in the OMA DB by considering the R2 rule in Listing
4.1. These inferred assertions also consider the symmetric inferences (i.e. if A
hasParalog B then B hasParalog A). Therefore, with the ORTH ontology based
on HOGs, we can e caciously represent RDF-based homology relations such as
hasParalog and hasOrtholog.</p>
        <p>
          The experiment has consisted on comparing the time performance of SPARQL
query rewrite and DL-safe [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] Horn-like rule based approaches. For this
purpose we have used the subqueries presented in Section 4. Each query has been
executed thirty times for each approach. We have solely considered one OMA
HOG at the LUCA taxonomic level, so containing 2,727 proteins. In this
experiment, we have used the Stardog 5 triple store [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] with 6GB of dedicated RAM
memory. All the tests were run in a computer with 3.5GHz dual-core Intel Core
i7 processor, Turbo Boost up to 4.0GHz, 16GB of 2133MHz LPDDR3 memory
and 1TB SSD. The choice of the Stardog is due to the fact that it supports
DLsafe Horn-like rules combined with OWL2 constructs and reasoning at query
execution time [
          <xref ref-type="bibr" rid="ref13 ref5">5, 13</xref>
          ].
        </p>
        <p>We executed the Q1 and Q2 queries in Listing 5.1 by using a SPARQL query
rewrite approach and the Stardog's DL-safe rule inference engine. The Q1 query
retrieves all hasOrtholog relations of the protein with the HUMAN29522 OMA
identi er. This protein is the cytochrome c oxidase subunit 1 encoded by the
MT-CO1 gene. Table 1 presents the results obtained in terms of query execution
time in milliseconds (mean and standard deviation) and the number of retrieved
results for the 30 executions of Q1 and Q2 queries. The Q2 query (see Listing
5.1) retrieves all hasParalog relations for the same protein (i.e. HUMAN29522 ).</p>
        <p>Q1: SELECT ?seq_1 { ?seq_1 orth:hasOrtholog oma:PROTEIN_HUMAN29522 }</p>
        <p>Q2: SELECT ?seq_1 { ?seq_1 orth:hasParalog oma:PROTEIN_HUMAN29522 }
Listing 5.1. Querying the orthologous (Q1) and paralogous (Q2) genes of MT-CO1
human gene in OMA database.</p>
        <p>From Table 1, we can conclude the SPARQL query rewrite approach is
106ms and 40ms faster in average than the DL-safe rule based approach
to retrieve the same amount of hasOrtholog and hasParalog assertions,
respectively. As a reminder, for the results in these tables, we only considered the HOG
that contains the HUMAN29522 protein. Although, there are 589,223 HOGs in
OMA DB. Table 2 shows the results of executing the queries in Listing 5.1 taking
into account all OMA HOGs and using a timeout of 5 minutes.</p>
        <p>Query Approach Mean time(ms) Std deviation ( ) #Results
Q1 SPARQL query rewrite 193.7 33.8 2,722
Q1 DL-safe rule based 300.3 78.1 2,722
Q2 SPARQL query rewrite 65.1 13.0 4
Q2 DL-safe rule based 104.6 17.8 4
Table 1. Performance comparison between SPARQL query rewrite and DL-safe
Hornlike rule based approaches for Q1 and Q2 queries in Listing 5.1.</p>
        <p>Table 2 demonstrates that the DL-safe Horn-like rule based approach is not
able to retrieve any results after 5 minutes of query execution by using the
Stardog triple store. This is mainly because the Horn-like rules to infer hasParalog
and hasOrtholog relations contain a transitive property labeled as \has part"
instead of the :hasHomologousMember* SPARQL property path11 (see query
in Listing 4.2). The performance issues are due to the fact that Stardog
processes rst the `has part' transitive property that does not contain any subject
or object assigned. Therefore, Stardog attempts to infer all possible `has part'
assertions over all HOGs to afterwards apply the join operations. As a reminder,
for the tests in Table 2, we are considering the whole OMA DB that contains
9,443,947 proteins without counting alternative splicing. This explains why the
DL-safe rule based approach based on Stardog is not capable of retrieving any
result in some milliseconds. However, by using :hasHomologousMember* SPARQL
11 https://www.w3.org/TR/sparql11-property-paths/
property path, Stardog calculates the query execution plan better as justi ed in
Table 2. Because of this, Stardog's SPARQL processor retrieves all results in
milliseconds. This also justi es why the SPARQL query rewrite approach had
better results than the DL-safe rule based one in Table 1 when considering only
one HOG.</p>
        <p>Despite the Stardog's results depicted in this section to process transitive
properties, the main bene t of using the Horn-like rule based approach described
in Section 4 is the possibility of reusing inferred concepts and properties to
de ne other Horn-like rules. This can be done in a modular way similar to
a function in traditional programming languages (e.g. C language). Therefore,
implicit information in an orthology database becomes explicit by de ning these
logical rules. Another bene t is the fact that we can take advantage of general
purpose inference engines to process the Horn-like rules.
6</p>
      </sec>
      <sec id="sec-10-3">
        <title>Conclusion</title>
        <p>To build the RC of a second version of the ORTH ontology, we made 27 modi
cations in the previous ORTH version that include adding and removing
properties, property domain, property range, classes and class restrictions. We also
discussed how the ORTH ontology should be instantiated to avoid for example
non-compliance with DL due to imported ontologies. Moreover, we described the
bene ts of using a rule based approach to infer new information from the
orthology data. In doing so, we can drastically reduce the number of stored triples,
facilitate the work of writing SPARQL queries and reuse inferred properties to
de ne new rules. We also argue about performance issues of a Horn-like rule
based approach compared to a query rewrite approach. Although our
experiments by using Stardog show that a SPARQL query rewrite approach is more
e cient, we cannot conclude it is signi cantly better than a DL-safe Horn-like
rule-based one. This is because Stardog does not calculate the query execution
plan in the same way as for transitive properties and SPARQL property path.</p>
        <p>One nal remark is concern about performing the tests in Section 5 by
using alternative triple stores that support Horn-like rules combined with OWL
2 constructs and perform reasoning at query execution time. In future work
we will consider annotating the ORTH entities by harnessing natural language
processing and keyword searching techniques.</p>
      </sec>
      <sec id="sec-10-4">
        <title>Acknowledgements</title>
        <p>This work has been nanced by the Swiss National Research Programme (NFP)
75 (see http://www.nfp75.ch) - SNSF Project 167149. Part of the work was
supported by the ROIS International Networking project and conducted through
NBDC/DBCLS BioHackathon 2017 (see http://www.biohackathon.org).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Complexible</given-names>
            <surname>Inc</surname>
          </string-name>
          .
          <article-title>: Stardog 5: The manual (</article-title>
          <year>2017</year>
          ) Available online: http://docs. stardog.com/.
          <source>Last accessed on October</source>
          ,
          <year>10th 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Altenho</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gil</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonnet</surname>
            ,
            <given-names>G.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dessimoz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Inferring hierarchical orthologous groups from orthologous gene pairs</article-title>
          .
          <source>PLoS One</source>
          <volume>8</volume>
          (
          <issue>1</issue>
          ) (
          <year>2013</year>
          ) e53786
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ashburner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ball</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blake</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Botstein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Butler</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cherry</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dolinski</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dwight</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eppig</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          , et al.:
          <article-title>Gene ontology: tool for the uni cation of biology</article-title>
          .
          <source>Nature genetics 25(1)</source>
          (
          <year>2000</year>
          )
          <fpage>25</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chiba</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uchiyama</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Spang: a sparql client supporting generation and reuse of queries for distributed rdf databases</article-title>
          .
          <source>BMC bioinformatics 18(1)</source>
          (
          <year>2017</year>
          )
          <fpage>93</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. de Farias,
          <string-name>
            <given-names>T.M.</given-names>
            ,
            <surname>Roxin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Nicolle</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Swrl rule-selection methodology for ontology interoperability</article-title>
          .
          <source>Data &amp; Knowledge Engineering</source>
          <volume>105</volume>
          (
          <year>2016</year>
          )
          <volume>53</volume>
          {
          <fpage>72</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fernandez-Breis</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiba</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>del Carmen</surname>
          </string-name>
          Legaz-Garc a, M.,
          <string-name>
            <surname>Uchiyama</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>The orthology ontology: development and applications</article-title>
          .
          <source>Journal of biomedical semantics 7(1)</source>
          (
          <year>2016</year>
          )
          <fpage>34</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Koonin</surname>
            ,
            <given-names>E.V.</given-names>
          </string-name>
          :
          <article-title>Orthologs, paralogs, and evolutionary genomics</article-title>
          .
          <source>Annu. Rev. Genet</source>
          .
          <volume>39</volume>
          (
          <year>2005</year>
          )
          <volume>309</volume>
          {
          <fpage>338</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Makris</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gioldasis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bikakis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christodoulakis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Ontology mapping and sparql rewriting for querying federated rdf data sources. On the Move to Meaningful Internet Systems</article-title>
          ,
          <string-name>
            <surname>OTM</surname>
          </string-name>
          <year>2010</year>
          (
          <year>2010</year>
          )
          <volume>1108</volume>
          {
          <fpage>1117</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Makris</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gioldasis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bikakis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christodoulakis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Sparql rewriting for query mediation over mapped ontologies</article-title>
          . Technical University of Crete (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Motik</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Reasoning in description logics using resolution and deductive databases</article-title>
          .
          <source>PhD thesis</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mungall</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torniai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gkoutos</surname>
            ,
            <given-names>G.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haendel</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>Uberon, an integrative multi-species anatomy ontology</article-title>
          .
          <source>Genome biology 13(1)</source>
          (
          <year>2012</year>
          ) R5
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>N.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Whetzel</surname>
            ,
            <given-names>P.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dorf</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gri</surname>
            <given-names>th</given-names>
          </string-name>
          , N.,
          <string-name>
            <surname>Jonquet</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rubin</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Storey</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chute</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          , et al.:
          <article-title>Bioportal: ontologies and integrated data resources at the click of a mouse</article-title>
          .
          <source>Nucleic acids research 37(suppl 2)</source>
          (
          <year>2009</year>
          )
          <article-title>W170</article-title>
          {
          <fpage>W173</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pauwels</surname>
            , P., de Farias,
            <given-names>T.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roxin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beetz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Roo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicolle</surname>
            ,
            <given-names>C.:</given-names>
          </string-name>
          <article-title>A performance benchmark over semantic rule checking approaches in construction industry</article-title>
          .
          <source>Advanced Engineering Informatics</source>
          <volume>33</volume>
          (
          <year>2017</year>
          )
          <volume>68</volume>
          {
          <fpage>88</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Sonnhammer</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gabaldon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , Sousa da Silva,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Robinson-Rechavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Boeckmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Dessimoz</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Big data and other challenges in the quest for orthologs</article-title>
          .
          <source>Bioinformatics</source>
          <volume>30</volume>
          (
          <issue>21</issue>
          ) (
          <year>2014</year>
          )
          <volume>2993</volume>
          {
          <fpage>2998</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Soucy</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gogarten</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          :
          <article-title>Horizontal gene transfer: building the web of life</article-title>
          .
          <source>Nature Reviews Genetics</source>
          <volume>16</volume>
          (
          <issue>8</issue>
          ) (
          <year>2015</year>
          )
          <volume>472</volume>
          {
          <fpage>482</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>