<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A large-scale gene-centric semantic web knowledge base for molecular biology</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>José Cruz-Toledo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alison Callahan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michel Dumontier</string-name>
          <email>michel_dumontier@carleton.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Biology, Carleton University</institution>
          ,
          <addr-line>Ottawa</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The discovery of the central role of genes in regulating the fundamental biochemical processes of living things has driven biologists to collect, analyze and re-use enormous amounts of information, and to make this information available in thousands of curated databases. The increasingly popular use of specialized terminologies, often organized into hierarchical taxonomies or more formal ontologies, to describe this data indicates that managing the total amount of resources available (big data) will surely continue to be an ongoing challenge. Here, we describe a biological gene-centric dataset (available at http://semanticscience.org/projects/gene-world), aimed at providing the reasoner community with a fully connected graph of data and ontologies of value to the bioinformatics community and for which there currently exists significant challenges in using automated reasoning for consistency checking and query answering of large ontology-mapped linked data.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>Semantic Web</kwd>
        <kwd>Bioinformatics</kwd>
        <kwd>DL reasoning</kwd>
        <kwd>SPARQL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The central dogma of molecular biology states that regions of DNA (genes) are
responsible for encoding molecular machines called proteins, which participate in and
control the biochemical reactions essential to sustaining life. The discovery of the
central role of genes as the blueprint of our evolutionary history and their involvement
in health and disease has driven biologists to characterize these very important
entities. Enormous amounts of information have been collected, analyzed, summarized
and re-published in thousands of curated databases [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and large central hubs such as
the databases of the National Center for Biotechnology Information (NCBI).
      </p>
      <p>
        The exponential growth of available molecular data clearly yields enormous
benefits to biologists attempting to elucidate the functioning of genes in related systems,
but it also presents significant challenges for modern biology. Consider that the
amount of data collected in this year alone to characterize a collection of biochemical
reactions (a pathway) will be on par with the amount of data that has ever been
collected about that pathway in the history of the field [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Moreover, the use of
specialized terminologies, often organized into hierarchical taxonomies or more formal
ontologies, indicates that managing the total amount of data (big data) will surely
continue to be an ongoing challenge. Indeed, it is the overall organization and
interpretation of this vast deluge of information that presents the greatest challenge.
      </p>
      <p>
        Motivated by this challenge, we present a preliminary version of a biological
genecentric dataset aimed at providing the reasoner community with a fully connected
graph of data and ontologies of value to the bioinformatics community, for which
there currently exists significant challenges in using automated reasoning for
consistency checking and query answering of large ontology-mapped linked data. We
focus our attention on one of the larger datasets in the Bio2RDF project [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] - NCBI
Gene - and consider queries that extend from this dataset into other datasets and
ontologies that together form a large ‘Gene-World’ knowledge base. Our Gene-World
knowledge base contrasts other ontologies and datasets that have been used to
benchmark OWL reasoners [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], such as LUBM [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and SNOMED-CT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], in several
respects: (i) Gene-World is a ‘real world’ knowledge base composed of existing
resources used by biologists and bioinformaticians on a daily basis, as opposed to an
arbitrary automatically generated knowledge base, (ii) it consists of a very large
Tbox and A-box and (iii) its T-Box consists of multiple ontologies with differing DL
expressivity. The datasets and ontologies described are available at [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Example
queries that can be used to evaluate RDF/OWL based reasoner performance over this
knowledge base are also described.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Datasets and Ontologies</title>
      <p>
        All Gene-World datasets are drawn from Bio2RDF Release 2 (released January
2013). The NCBI Gene [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] Bio2RDF dataset consists of 394,026,267 triples with
12,543,449 unique subjects, 60 unique predicates, and 121,538,103 unique objects.
NCBI Gene describes genes including their names, reference sequences, variants,
phenotypes, pathways and cross-references to related resources. HomoloGene [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a
database of programmatically generated clusters of homologous, including paralogous
and orthologous, genes from a set of 21 completely sequenced eukaryotic genomes.
The HomoloGene Bio2RDF dataset consists of 1,281,881 triples with 43,605 unique
subjects, 17 unique predicates and 1,011,783 unique objects, and uses NCBI Gene
identifiers to refer to the genes it clusters. NCBI Gene makes reference to three
ontologies: the Gene Ontology (GO) for asserting function, process or location annotations
about genes, the Evidence Code Ontology (ECO) for qualifying the source of these
GO annotations, the NCBI Taxonomy (TAXON) for asserting the species of a gene.
The Semanticscience Integrated Ontology (SIO) and the Sequence Ontology (SO)
have been mapped to NCBI Gene Bio2RDF vocabulary classes and relations (Table
1) to ground the dataset types and predicates in domain-specific ontologies.
      </p>
      <p>
        The Gene Ontology (GO) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is a hierarchy of controlled biological terms
that is organized into three orthogonal ontologies which capture knowledge about
cellular locations, biological processes and molecular functions. The terms and
relations contained in GO are serialized as a directed acyclic graph where concepts are
organized into a hierarchy in which more specific GO terms are subsumed by more
general terms by following is a or in some cases part of relationships. The Evidence
Code Ontology (ECO) is a controlled vocabulary used for describing the scientific
evidence that supports an assertion. ECO’s 290+ terms include descriptions of
laboratory experiments, computational methods and literature annotation terminology. The
NCBI Taxonomy (TAXON) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a database of taxonomic lineage obtained from a
variety of sources, including primary literature, external databases and expert human
curation efforts for databases hosted by the NCBI. The Sequence Ontology (SO) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
describes a rich set of features and attributes of biological sequences. The terms and
relations included in this ontology characterize both physical attributes of biological
sequences (i.e. binding sites, exons) and the processes in which biological sequences
may be involved in (i.e. translational frameshifts, transitions, deletions, etc). The
Semanticscience Integrated Ontology (SIO) provides a basic set of types and relations
for describing objects, processes and attributes of biological entities. Fig. 1 shows
how these ontologies are used within the Gene dataset, or are linked to the Gene
dataset by virtue of mappings to SIO.
In this section, we describe reasoning tasks over the Gene-World knowledge base that
can be used to benchmark the performance of an OWL reasoner or SPARQL query
system. After loading all the triples for the NCBI Gene and HomoloGene RDF
datasets, as well as all ontologies listed in Table 1, the first benchmark task for an OWL
reasoner would be to check the consistency of the combined knowledge base. While
each component is expected to not contain any unsatisfiable classes, mappings
between SIO and Gene, SO and Gene or the disjoint class axioms for the NCBI
taxonomy ontology may lead to class or property unsatisfiability.
      </p>
      <p>
        Below, we present a set of DL and SPARQL-DL queries over the combined
knowledge base that may not give the complete set of results without reasoning
support for some portion of OWL2-DL (there are no nominals in the knowledge base).
GitHub Gists of all queries are available at [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>Fig. 1. Links between a gene in the NCBI Gene dataset and annotations of its function,
associated cellular components and/or processes. Functions, cellular components, and processes are
described using the Gene Ontology (GO, in green), while the associated evidence type for an
association is described using the Evidence Codes Ontology (ECO, in orange). The taxonomic
group for a gene is described using NCBI Taxonomy (in blue). HomoloGene (in purple) groups
related genes and taxa. Each part of an NCBI Gene record is mapped to the Semanticscience
Integrated Ontology (SIO, in yellow), which also has mappings to the Sequence Ontology (SO,
in pink). Ellipses represent resources. Boxes represent ontology classes.
3.1</p>
      <sec id="sec-2-1">
        <title>Query answering</title>
        <sec id="sec-2-1-1">
          <title>Q1: retrieve transfer RNA genes DL query: tRNA-gene</title>
          <p>simple query that retrieves a type assertion in NCBI gene data</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Q2: retrieve human genes</title>
          <p>DL query: gene that has_taxid some ‘Homo sapiens [taxid:9606]’
conjunctive query over NCBI Gene and NCBI taxonomy</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Q3: retrieve genes that are from any mammal but human DL query: gene that has_taxid some (‘Mammalia [taxid: 40674]’ and not ‘Homo sapiens [taxid:9606]’)</title>
          <p>conjunctive query with negation, and subclass reasoning over asserted hierarchy and
class and relation mappings to upper level ontology
Q4: retrieve genes that are annotated with a specific enzymatic function:
DL query: gene that ‘has function’ some ‘acetylglucosaminyltransferase activity [go:
0008375]’
simple conjunctive query with subclass reasoning
Q5: retrieve genes that are annotated with a specific function that was not inferred by
computational analysis.</p>
          <p>DL query: gene that ‘has function’ some function that inverse(go_term) some (‘has
evidence’ some (not ‘inferred from electronic annotation’))
conjunctive query using negation, mappings, inverse
Q6: retrieve organisms that have genes with an enzymatic activity that was not
obtained by computational analysis
DL query: ‘Mammalia [taxid: 40674]’ that inverse(has_taxid) some (gene that 'has
function' some (function that inverse(go_term) some ('has evidence' some (not
'inferred from electronic annotation')))
conjunctive query with negation, inverse, mappings
3.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Querying using class axioms</title>
        <p>All of the ontologies listed in Table 1 have rich class hierarchies. SIO and the
Sequence Ontology (SO) also have axiomatic class definitions. DL queries can thus
leverage the axioms used to define classes, as well as the class hierarchy.
Q7: retrieve a gene that encodes for a certain kind of molecule using SIO
DL query: gene and (encodes some ‘small cytoplasmic RNA (scRNA)’)
reasoning with subclass axioms from mapped ontology
Q8: retrieve a gene that encodes for a certain kind of molecule using SO
DL query: gene and (has_quality scRNA_encoding)
reasoning with subclass axioms from mapped ontology
3.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>SPARQL DL queries</title>
        <p>
          SPARQL DL [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] is a subset of SPARQL that allows the formulation of queries using
combination of OWL semantics and SPARQL variables. SPARQL DL is particularly
useful in cases where one wishes to retrieve instances that are linked to some other
resource, but also take advantage of DL reasoning. This is possible by using SPARQL
variable bindings.
        </p>
        <p>Q9: retrieve orthologous human and mouse genes annotated with function to bind
ATP
Type(?human_gene, ‘gene’), Type(?mouse_gene, ‘gene’), Type(?homologene_group,
HomoloGene_Group), PropertyValue(?human_gene, has_taxid, ‘Homo sapiens’),
PropertyValue(?mouse_gene, has_taxid, ‘Mus musculus’),
PropertyValue(?human_gene, ‘has function’, ‘ATP binding’),
‘ATP</p>
        <p>binding’),
?human_gene),
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Summary</title>
      <p>We have described Gene-World, a large gene-centric knowledge base consisting of
Bio2RDF datasets with over 395 million statements linked to five bio-ontologies with
varying degrees of DL expressivity. The size and complexity of this dataset in
addition to the provided DL and SPARQL-DL queries may provide a useful benchmark
against which to evaluate OWL reasoner capability and efficiency for life science
datasets. Should this preliminary knowledge base become useful in reasoner
evaluation, we expect to extend it include more of the 20+ datasets and hundreds of
ontologies in Bio2RDF.
5</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Fernandez-Suarez</surname>
            <given-names>XM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Galperin</surname>
            <given-names>MY</given-names>
          </string-name>
          :
          <article-title>The 2013 Nucleic Acids Research Database Issue and the online molecular biology database collection</article-title>
          .
          <source>Nucleic Acids Res</source>
          <year>2013</year>
          ,
          <volume>41</volume>
          (Database issue):
          <fpage>D1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chuang</surname>
            <given-names>HY</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hofree</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ideker</surname>
            <given-names>T</given-names>
          </string-name>
          :
          <article-title>A decade of systems biology</article-title>
          .
          <source>Annu Rev Cell Dev Biol</source>
          <year>2010</year>
          ,
          <volume>26</volume>
          :
          <fpage>721</fpage>
          -
          <lpage>744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Callahan</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cruz-Toledo</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            <given-names>M</given-names>
          </string-name>
          :
          <article-title>Ontology-Based Querying with Bio2RDF's Linked Open Data</article-title>
          .
          <source>Journal of Biomedical Semantics</source>
          <year>2013</year>
          ,
          <article-title>4(Supplement 1):S1.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Dentler</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cornet</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>ten Teije</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Keizer</surname>
            <given-names>N</given-names>
          </string-name>
          :
          <article-title>Comparison of reasoners for large ontologies in the OWL 2 EL profile</article-title>
          .
          <source>Semantic Web</source>
          <year>2011</year>
          ,
          <volume>2</volume>
          (
          <issue>2</issue>
          ):
          <fpage>71</fpage>
          -
          <lpage>87</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Guo</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heflin</surname>
            <given-names>J</given-names>
          </string-name>
          :
          <article-title>LUBM: A benchmark for OWL knowledge base systems</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          <year>2005</year>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          -3):
          <fpage>158</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Stearns</surname>
            <given-names>MQ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Price</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spackman</surname>
            <given-names>KA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            <given-names>AY</given-names>
          </string-name>
          :
          <article-title>SNOMED clinical terms: overview of the development process and project status</article-title>
          .
          <source>Proc AMIA Symp</source>
          <year>2001</year>
          :
          <fpage>662</fpage>
          -
          <lpage>666</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Gene-World:
          <article-title>A large-scale gene-centric semantic web knowledge base for molecular biology</article-title>
          [http://semanticscience.org/projects/gene-world]
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Maglott</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ostell</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pruitt</surname>
            <given-names>KD</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tatusova</surname>
            <given-names>T</given-names>
          </string-name>
          :
          <article-title>Entrez Gene: gene-centered information at NCBI</article-title>
          .
          <source>Nucleic Acids Res</source>
          <year>2011</year>
          ,
          <volume>39</volume>
          (Database issue):
          <fpage>D52</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <article-title>Database resources of the National Center for Biotechnology Information</article-title>
          .
          <source>Nucleic Acids Res</source>
          <year>2013</year>
          ,
          <volume>41</volume>
          (Database issue):
          <fpage>D8</fpage>
          -
          <lpage>D20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ashburner</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ball</surname>
            <given-names>CA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blake</surname>
            <given-names>JA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Botstein</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Butler</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cherry</surname>
            <given-names>JM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            <given-names>AP</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dolinski</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dwight</surname>
            <given-names>SS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eppig</surname>
            <given-names>JT</given-names>
          </string-name>
          et al:
          <article-title>Gene ontology: tool for the unification of biology</article-title>
          .
          <source>The Gene Ontology Consortium. Nat Genet</source>
          <year>2000</year>
          ,
          <volume>25</volume>
          (
          <issue>1</issue>
          ):
          <fpage>25</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <article-title>OWL Export of GO DATABASE DAILY TERMDB</article-title>
          [http://archive.geneontology.org/latest-termdb/go_daily-termdb.owl.gz]
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Eilbeck</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            <given-names>SE</given-names>
          </string-name>
          :
          <article-title>Sequence ontology annotation guide</article-title>
          .
          <source>Comp Funct Genomics</source>
          <year>2004</year>
          ,
          <volume>5</volume>
          (
          <issue>8</issue>
          ):
          <fpage>642</fpage>
          -
          <lpage>647</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Gene-World</surname>
            <given-names>DL</given-names>
          </string-name>
          and
          <string-name>
            <surname>SPARQL-DL Queries</surname>
          </string-name>
          [http://semanticscience.org/projects/geneworld/gene-world-query-gists.html]
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Sirin</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsia</surname>
            <given-names>B</given-names>
          </string-name>
          :
          <article-title>SPARQL-DL: SPARQL Query for OWL-DL</article-title>
          .
          <source>In: 3rd OWL Experiences and Directions Workshop</source>
          (OWLED-
          <year>2007</year>
          ). Innsbruck, Austria;
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>