<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SADI for GMOD: Semantic Web Services for Model Organism Databases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ben Vandervalk</string-name>
          <email>ben.vvalk@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michel Dumontier</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>E Luke McCarthy</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark D Wilkinson</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Biology, Carleton University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>James Hogg Research Centre</institution>
          ,
          <addr-line>Heart</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Lung Institute, University of British Columbia</institution>
        </aff>
      </contrib-group>
      <fpage>70</fpage>
      <lpage>75</lpage>
      <abstract>
        <p>Here we describe work-in-progress on the SADI for GMOD project (SADI: Semantic Automated Discovery and Integration; GMOD: Generic Model Organism Database), a distribution of ready-made Web services that will bring additional model organism data onto the Semantic Web. SADI is a lightweight standard for implementing Web services that natively consume and generate RDF, while GMOD is a widely-used toolkit for building model organism databases (e.g. FlyBase, ParameciumDB). The SADI for GMOD services will provide a novel mechanism for analyzing data across GMOD sites, as well as other bioinformatics resources that publish their data using SADI.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Web</kwd>
        <kwd>Web services</kwd>
        <kwd>SADI</kwd>
        <kwd>GMOD</kwd>
        <kwd>model organism databases</kwd>
        <kwd>bioinformatics</kwd>
        <kwd>sequence features</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        One of the most pervasive problems in bioinformatics is the integration of data
and software across research labs. While the prevailing method of sharing data is
through centrally controlled repositories such as GenBank [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], manual curation
of submissions imposes a bottleneck on the quantity and types of data that
can be integrated. In addition, centralization also places limits on the types of
visualization and analysis tools that can readily be used with the data.
      </p>
      <p>
        One prominent example of a system for integrating distributed biological
data is the Distributed Annotation System (DAS) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. A DAS server provides
access to sequence annotations (also known as sequence features) via a
RESTful [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] interface, and returns the annotations in a simple, standardized XML
format. Client applications (e.g. genome browsers) that understand the DAS
protocol and XML format are able to provide users with a uni ed view of
sequence annotations from multiple sites. Nevertheless, DAS has its limitations.
The XML datasets returned by DAS servers cannot be integrated without
specialized software, and cannot be readily combined with other types of data (e.g.
protein-protein interaction networks). In addition, the majority of
bioinformatics analysis tools (e.g. BLAST) do not natively understand DAS, and thus they
require specialized conversion scripts in order to process data from DAS servers.
      </p>
      <p>In this paper we describe work-in-progress on SADI for GMOD, a collection
of Semantic Web services that implement DAS-like functionality. The goal of
SADI for GMOD is to provide a more general solution for federating sequence
data that is compatible with the Semantic Web, and which facilitates automated
integration with analysis software and other types of bioinformatics data.
Toward this goal, we propose a standard model for representing sequence features
in RDF/OWL. The services are implemented according to the SADI
(Semantic Automated Discovery and Integration) standard, and are targeted toward
maintainers of GMOD (Generic Model Organism Database) sites. Additional
information about these two projects is provided in the following section.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Projects</title>
      <p>
        SADI (Semantic Automated Discovery and Integration) SADI [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a
lightweight standard for the implementation of Semantic Web services.
Services adhering to the SADI recommendations natively consume and
generate data in RDF form, and can be invoked by issuing an HTTP POST
to the service URL with an input RDF document as the payload. One of
the principal strengths of SADI is that there are no specialized protocols
or messaging formats. The interfaces to each service { that is, the expected
structure of the input and output RDF documents { are described by means
of a provider-speci ed input OWL class and output OWL class, respectively.
      </p>
      <p>
        Further details about SADI are given in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        GMOD (Generic Model Organism Database) The GMOD project [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is
a popular collection of open source software which facilitates the
construction of a model organism database and its associated website. The central
component of GMOD is a database schema called Chado [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which houses
a variety of datatypes such as sequences, sequence features, controlled
vocabularies, and gene expression data. Scripts are provided for creating and
loading a Chado instance as a Postgres database.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Services</title>
      <p>
        SADI for GMOD consists of ve services which provide fundamental operations
for accessing sequence feature data, as shown in Table 1. A sequence feature is
an annotated region of a biological sequence (DNA, RNA, or amino acid) such
as a gene, an exon, or a protein domain. Related features are accessible through
a hierarchy of parent-child relationships, and the GMOD wiki provides a set of
recommendations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] indicating where particular feature types should be located
in the hierarchy. For example, the GMOD conventions assert that a gene should
be a child feature of a chromosome and that an mRNA transcript should be a
child feature of a gene. The relationship connecting the parent and child feature
will be either \has part" or \derives into", depending on whether the features
are spatially or temporally related. For instance, the relationship between a
chromosome and a gene is \has part", whereas the relationship between a gene
and a transcript is \derives into".
The implementation of the SADI for GMOD services is relatively
straightforward. The main point of interest is how the data is modeled in RDF/OWL. The
entities that need to be modeled are feature descriptions, genomic coordinates,
and database identi ers, as shown in Table 2.
      </p>
      <p>
        In Listing 1, we show an example feature description for a tRNA gene in
Drosophila melanogaster, encoded in TURTLE format. The principal ontology
used for the encoding is SIO (Semantic Science Integrated Ontology) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which
provides a large collection of properties for capturing mereological, temporal,
and other types of relationships. In addition, features are typed using terms
from the Sequence Ontology [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Some readers may initially balk at the apparent
complexity and opacity of Listing 1; however, it is important to emphasize that
the primary goal of the encoding is to facilitate automatic integration of data,
whereas simplicity and human-readability are secondary considerations. There
are several data modeling practices that, when understood, should help to clarify
Listing 1:
1. Distinct entities are always modeled as distinct nodes in the graph.
      </p>
      <p>In non-RDF formats (e.g. relational databases), it is easy to con ate related
entities. For example, the sequence of a chromosome and the chromosome
itself are often thought of as the same entity. However, this is not precisely
true; the sequence is an abstract string representation of one of the strands
of the chromosome. In order to facilitate accurate and automated processing
of the data, it is often helpful to make such distinctions explicit. In Listing
1, the tRNA gene has a ranged sequence position in relation to a sequence
that represents the minus strand of a chromosome.
2. URIs are frequently opaque. Ontologies providers (e.g. OBI, GO, SO)
assign numeric URIs to classes and relationships in their ontologies for two
reasons: i) the URIs can have labels in multiple languages, and ii) the labels
can be updated without requiring updates to dependent datasets.
3. Literals are modeled as typed resources. It is simplest to represent
literals in RDF as plain strings or numbers, with the type of the literal
indicated by the XSD datatype (e.g. xsd:float). Here, literals are modeled
as instances of a particular rdf:type (e.g. range:StartPosition), with the
actual values being speci ed by the \has value" property (i.e. SIO 000300).
This approach provides a more exible typing mechanism and allows
additional information such as provenance to be attached to the values.
4. Database identi ers are modeled as typed string values. In Listing
1, the feature URI http://lsrn.org/FLYBASE:FBgn0011935 has an attached
identi er with an rdf:type of lsrn:FLYBASE Identifier and a value of
\FBgn0011935". This may seem redundant, as the URI already acts as a
unique identi er for the feature. We have adopted the practice of
attaching typed, string-encoded database identi ers to URIs in order to address
a common problem on the Semantic Web, namely the tendency of data
providers to invent their own URI schemes. For example, the URI for UniProt
protein P04637 is alternatively represented on the Semantic Web as http:
//purl.uniprot.org/uniprot/P04637 (UniProt and LinkedLifeData), http:
//bio2rdf.org/uniprot:P04637 (Bio2RDF and Linked Open Drug Data), and
http://lsrn.org/UniProt:P04637 (SADI). While the existence of multiple
URIs for the same entity impedes data integration across sites, data providers
often create their own URI schemes so that the URIs will resolve to datasets
or webpages on their own sites. We propose attaching database identi ers to
URIs as shown here, so that equivalent URIs can automatically be reconciled
across sites, while still allowing the URIs created by each provider to resolve
to their own data.</p>
      <p>Listing 1. Example RDF encoding for a tRNA gene in Drosophila melanogaster.
a Bio::DB::SeqFeature::Store database which must be loaded separately
by the GMOD site maintainer. The most common scenario is to load the data
from a set of GFF les into a mysql database; Bio::DB::SeqFeature::Store
provides the bp seqfeature load.pl script for this purpose.
2. Unpack the SADI for GMOD tarball in the cgi-bin directory. The
tarball will be unpacked into a SADI directory tree which will contain the
Perl CGI scripts as well as the required Perl modules.
3. Add database connection parameters to the SADI for GMOD
conguration le. The con guration le will be located in the SADI
subdirectory of cgi-bin.
6</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>While the majority of existing biological Web services use XML for data
exchange, SADI services use RDF/OWL in order to facilitate automatic
integration of data across service providers. As such, the SADI for GMOD services will
provide a novel tool for conducting analyses across model organism databases,
as well as other biological data sources and tools that are published using SADI.
7</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>Initial development of SADI and SHARE has been funded by a special
initiatives award from the Heart and Stroke Foundation of British Columbia and
Yukon, with additional funding from Microsoft Research and an operating grant
from the Canadian Institutes for Health Research (CIHR). In addition, core
laboratory funding has been supplied by the National Sciences and
Engineering Research Council of Canada (NSERC). Development of SADI for GMOD,
as well as hundreds of other SADI services, has been funded by a grant from
Canada's Advanced Research and Innovation Network (CANARIE).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Wilkinson</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vandervalk</surname>
            ,
            <given-names>B.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCarthy</surname>
            <given-names>E.L.</given-names>
          </string-name>
          :
          <article-title>SADI Semantic Web Services - cause you cant always GET what you want</article-title>
          !
          <source>Services Computing Conference (APSCC)</source>
          <year>2009</year>
          ,
          <fpage>13</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>2. GMOD homepage, http://gmod.org</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>3. Introduction to Chado, GMOD Wiki, http://gmod.org/wiki/Introduction to Chado</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>4. Semantic Science on Google Code, http://code.google.com/p/semanticscience/</mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Eilbeck</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mungall</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          , et al.:
          <article-title>The Sequence Ontology: a tool for the uni cation of genome annotations</article-title>
          .
          <source>Genome Biology</source>
          <volume>6</volume>
          :
          <issue>5</issue>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Benson</surname>
            ,
            <given-names>D.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karsch-Mizrachi</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lipman</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          , et al.:
          <source>GenBank. Nucleic Acids Research</source>
          <volume>36</volume>
          ,
          <fpage>D25</fpage>
          -
          <lpage>D30</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dowell</surname>
          </string-name>
          , R.D.,
          <string-name>
            <surname>Jokerst</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Day</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and et al.:
          <article-title>The Distributed Annotation System</article-title>
          .
          <source>BMC Bioinformatics</source>
          <volume>2</volume>
          :7 (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fielding</surname>
          </string-name>
          , R.T.:
          <article-title>Architectural styles and the design of network-based software architectures</article-title>
          . University of California, Irvine (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>