<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Data Integration for Francisella tularensis novicida Proteomic and Genomic Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nadia Anwar</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ela Hunt</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Walter Kolch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew Pitt</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Beatson Institute for Cancer Research Glasgow</institution>
          ,
          <addr-line>G61 1BD</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer and Information Sciences, University of Strathclyde</institution>
          ,
          <addr-line>Glasgow G1 1XB</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Division of Integrative and Systems Biology Faculty of Biomedical and Life Sciences University of Glasgow</institution>
          ,
          <addr-line>Glasgow G12 8QQ</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper summarises the lessons and experiences gained from a case study of the application of semantic web technologies to the integration of data from the bacterial species Francisella tularensis novicida (Fn). Fn data sources are disparate and heterogeneous, as multiple laboratories across the world, using multiple technologies, perform experiments to understand the mechanism of virulence. It is hard to integrate such data, and this work examines the role of explicitly provided data semantics in data integration. We test whether the semantic web technologies could be used to reveal previously unknown connections across the available Fn datasets. We combined this data with genome data and with public domain annotations within GO, KEGG and the SUPERFAMILY database. Through this connected graph of database cross references, we extended the annotations of an experimental data set by superimposing onto it the annotation graph. Identi ers used in the experimental data automatically resolved and the data acquired annotations in the rest of the RDF graph. This happened without the expensive manual annotation that would normally be required to produce these links. Other lessons learnt and future challenges that result from this work are also presented in detail.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The ultimate goal in biology is to understand how genes translate to phenotypes,
i.e., to understand the complex relationship between genes, messenger RNAs,
proteins and tissues. Current research methodologies study these components
individually through genomic, transcriptomic, proteomic and metabolomic
experiments. Since these technologies require speci c expertise for data generation
and analysis, it is quite rare to nd experiments that are performed using all of
these technologies on the same sample. Also, attempts at correlating data across
these technologies have so far been disappointing, indicating that measures of
transcript level in microarray experiments is not a good indicator for protein
abundance [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Yet, fundamentally, we know that genes, transcripts, proteins
and all the processes performed in cells form a complex system that requires
each of these and many other components to function collaboratively. There are
signi cant bene ts to be gained by combining data across experiments and
individual technologies [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. It is a long term goal in biology to understand the
system as a whole, and, in the short term, combined access to these data can
corroborate the predictions and validate discoveries made by each technology.
      </p>
      <p>
        The post genomic technologies such as transcriptomics [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and proteomics
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have strengths and weaknesses [
        <xref ref-type="bibr" rid="ref4 ref7">7, 4</xref>
        ]. Given the central dogma, in theory,
when these data are used together, false discovery rates inherent within those
weaknesses could be reduced. The addition of corroborating data can be used
to validate predictions that are on the edge of the statistical thresholds. For
example, in proteomics experiments, a protein with three or more peptide hits
is used as a reliable threshold for identi cation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and many proteins identi ed
by the presence of a single peptide hit are discarded. These singleton hits can
be supported with transcription abundance data or metabolomic data if the
peptide hit falls into a pathway where metabolites reliant on the protein of
interest have been identi ed. In another example, identi cation of quantitative
trait loci (QTLs) through linkage analysis can identify regions in the genome
attributed to a particular polygenic trait (disease phenotype): these regions will
contain many genes, some of which are responsible for the trait and some which
are not and expression QTL (eQTL) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] analysis reduces the number of candidate
genes in a QTL interval through the addition of gene expression pro les [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. It
has successfully identi ed genes associated with complex human diseases such
as asthma [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Combining data from various technologies is very di cult and is very often
done by hand. The source data that are produced and stored independently
of each other. Thus, gathering data on a particular pathway, organism or
disease generated from these technologies requires collecting and combining these
data into spreadsheets, or using database software. Once these data are gathered
from individual data sources, subsequent downstream analysis may be required,
such as statistical tests for clustering or correlating data, or specialist algorithms
that can compare the data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In order for these data to be used more e ectively
the data at each level of analysis need to be readily accessible and easily
combined. Data integration, however, is not trivial and requires resolving syntactic,
structural and semantic di erences across the data sources. The heterogeneity
with respect to syntactic di erences includes the di erences in the data models
such as relational databases, object stores, XML stores, at les or spreadsheets.
Structural di erences lie in the data schemas that each source speci es and the
query languages that they support. Semantic di erences are expressed in the
terminologies (vocabularies) they recognise. The methodologies that are employed
to overcome these problems have so far proved to be di cult to reproduce on
alternative data sets and they remain to be di cult to maintain and automate.
Also, since database heterogeneity is unavoidable, and a single data model for
all biomedical data is neither probable nor possible, we require a mechanism to
integrate data in an automated, scalable and exible way.
      </p>
      <p>
        In this paper a new solution to exible data integration is being examined.
Semantic integration based on RDF [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is being tested on omics data generated
for the organism Francisella tularensis novicida (Fn). Fn is a gram negative
bacterium that causes the plague like disease tularemia. In the most highly
virulent subspecies, Francisella tularensis tularensis (Ft) only 10-50 bacteria are
required to cause infection in humans. Ft is much more infectious than the
bacterium Bacillus anthracis, anthrax, which has been used as a biological weapon.
The ability to cause severe disease, the low infectious dose and the bioterrorism
concerns posed by this organism have led to the availability of increased research
funds [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. While highly infectious in mice, Francisella tularensis novicida (Fn)
strain U112 is a less virulent subspecies infecting only immunocompromised
humans. This has allowed this subspecies to be well studied in the laboratory. The
genome of all four subspecies have been sequenced and compared [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
Additionally, numerous transcriptomic and proteomic experiments have been performed
on Fn.
      </p>
      <p>
        Of particular interest is the Francisella pathogenicity island (FPI) and the
MglA transcriptional regulator. The FPI is a 30Kb region containing 16-19 genes
whose functions remain unknown and are essential for growth within macrophage
cells. Macrophages are free oating cells within the vascular system and are a
part of the innate immune response. Their role is to engulf and digest pathogens
in a phagolysosome, an organelle containing digestive enzymes. Normally these
cells are a hostile environment for pathogens such as Francisella. However,
Francisella is able to survive and replicate in macrophages by escaping the
phagolysosome into the macrophage cytosol where they can replicate and ultimately
escape, causing cell death. Experimental evidence shows that escape from the
phagolysosome is reliant on genes encoded within the FPI [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        In addition to the FPI, research has focused on a spontaneous mutant that is
unable to disrupt the phagolysosome and replicate in the cytosol. The gene that
was disrupted in this mutant was named MglA (Macrophage growth locus A).
The product of this gene regulates the transcription of genes within the FPI and
approximately 90 other genes. In an attempt to understand how MglA controls
the transcription of virulence factors, many proteomic [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and transcriptomic
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] experiments have been performed.
      </p>
      <p>In the majority of published studies experimental data sources are analysed
manually and data elements are manually linked to online data sources. More
e cient analysis can be performed by the biologists if available online data could
be easily integrated with experimental data. However, annotating every
experiment, to the same extent as a genome, is very rarely performed due to time
constraints. Biologist are therefore working with only part of the picture. We
propose here a semantic data integration solution that would facilitate
integration of online Fn data sources with individual experimental data sets in a simple
and e cient manner.</p>
      <p>The rest of this paper proceeds as follows. In Section 2 we provide some
background on data integration methods and in Section 3 we broaden this with
selected biological applications of data integration. Section 4 presents our
solution which combines the graphs of linked data, and in Section 5 we discuss our
results and observations. Then we conclude.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data Integration</title>
      <p>
        The goal of a data integration system is to provide uniform access to a set of
heterogeneous data sources, and to free the user from the knowledge about how
data are structured at the sources and how they are to be reconciled in order to
answer queries. Data integration is most commonly achieved using one of three
approaches: application integration (mediation), database federation and data
warehousing [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        Application integration involves writing special purpose software agents [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
that can query individual data sources via a single interface and then combine
and return the results to the user. However, these applications can be fragile
and expensive to maintain. Since integration is coded into the applications that
are initially inexpensive and simple to build, these systems are notoriously
fragile and susceptible to changes in the underlying systems that are integrated.
Adding new data sources often requires the application to be completely
rewritten. Very little integration is actually achieved through this approach. The
data sources remain autonomous, queries are performed locally and the results
that are gathered are combined and returned to the user. Therefore, if analysis
or comparison of the data received is required this needs to be coded into the
application. Portals o er another approach that is similar to application
integration [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Usually, portals use web services to facilitate cross-database queries
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. In these systems a query is captured by a mediating script (wrapper) which
translates the query to the various data sources and returns the results to the
user. Portals usually collect the data but do not integrate, rather, the data from
the di erent sources are displayed separately within the portal interface.
      </p>
      <p>The major advantage of mediation is that the application/portal delivers
up-to-date data. Each source is mapped and the query mechanism is coded
into a wrapper that is hidden from the user. The user accesses each source
through a uniform query interface. The disadvantage to this approach is that
only the queries supported by each individual system can be wrapped into the
application/portal.</p>
      <p>A more robust approach to data integration uses database federation (or
mediation carried out by the database engine). Database federation describes a
particular architecture where a relational database management system provides
uniform access to a number of heterogeneous data sources. The data sources are
federated, since they are linked together by the database management system.
Database federation is an e ective approach to the integration of heterogeneous
data sources when the data can not be materialised into a data warehouse.</p>
      <p>Data integration using a data warehouse approach, where data from the data
sources are physically combined into one structure, is a very mature solution.
The biggest drawback to developing a data warehouse is the scale of the
resource required to integrate source data, and such data integration is usually
performed piecemeal in data warehouses. Also, the integration performed by
data warehouses is rarely reusable between projects. Each new project,
therefore, has to perform its own data integration from scratch. Data warehouses are
notoriously di cult to build, expensive to maintain and in exible to changes
in the questions that can be asked. This is largely because they require a copy
to be made of data from all of the underlying data sources in a synchronized
extraction, transformation and loading (ETL) process. Data not extracted into
the warehouse cannot be queried conveniently, and changing the data that are
selected involves considerable redesign work. This places a large upfront design
burden on the warehouse schema and the ETL process. Biological data
integration requires a more exible technology that is amenable to the ever changing
landscape of biological data.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Biological Data Integration</title>
      <p>
        Initial solutions used to interoperate across bioinformatics databases used
precomputed cross-references or Linkouts [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. These database cross references are
used in sequence databases to link to functional annotations within other databases.
For example, EMBL nucleotide database cross links to protein sequence database
Uniprot, protein function databases such as Prosite and Interpro, protein
structure databases, enzyme and pathway databases and the literature database
Pubmed. These links are based on identi ers and are calculated using sequence
analysis tools. Sequence databases deliver data to users via at le downloads
and are indexed in systems such as SRS [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and Entrez[
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Cross references
in the databases enable users to move seamlessly from one database to another.
However, the databases are linked together rather than integrated.
      </p>
      <p>
        The increased complexity of biological data and the analyses performed on
these data led to the development of more complex data integration solutions.
Application integration for the interoperation of data and applications became
the mechanism of choice when technologies such as CORBA became popular [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ].
There are also examples of federated systems, such as BioKleisli [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] which used
a query language to query and manipulate data that were maintained in di erent
formats and DiscoveryLink from IBM [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] which provides users with a virtual
database which can be accessed using SQL queries. Several data warehouse
solutions have also been described [28{30]. None of these integration system can
be easily extended or adapted to alternative data sets. This is mostly due to
the underlying weaknesses of the technologies that were used to build the
systems. Biological integration is not a solved problem. As new technologies become
avaiable, the bioinformatics community exploits these with varied success. For
example, semantic data integration is now in vogue [31{35] as it o ers a
solution to data integration that is more exible and powerful. The advantages of
semantic web technologies make it a very attractive alternative to traditional
integration. This research project has aimed to understand how semantic data
integration can be used e ectively for biological data. A proof of concept
exercise was performed to integrate data sets from laboratories studying the bacteria
Francisella tularensis using multiple functional genomics technologies. Initially
we focused on integration as a means to extend data annotations.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Proof of Concept - Combining RDF data</title>
      <p>
        Rather than data integration in the traditional sense where overlapping data
elements are resolved into one structure, genomic, transcriptomic and proteomic
data need to be linked together using a sca old that represents their relatedness.
Semantic web technologies o er exactly this sca old. Since genomics provides
data on genes, transcriptomic experiments provide data on the transcription of
genes (in particular tissues or under speci c conditions), and proteomics provides
identi ed peptides, this is not a simple case of resolving di erent data types and
data formats. In this situation there are no common data elements between the
data sets. As we deal here with data relationships which do not involve equality
but di erent degrees of similarity or physical overlap, it is clear that traditional
integration methods can not match these data in a simple manner. However,
since these data are mutually related, integration can be achieved by using
semantic web technologies. Data are combined in a standard data model using
RDF and RDF-S. An ontology then maps the relationships between the entities
within the RDF. The rich semantics within an ontology allows the de nition
of detailed relationships between concepts, whereas a database schema de nes
only the allowed structure of a set of relations. This makes it easier to merge
ontologies, or to map them to one another. Ontologies form just one layer in the
semantic web stack. Full bene ts of data integration can be achieved when the
semantic web technologies are layered together. The use of unique identi cation
and XML exchange standards (see discussion) will greatly improve the level of
integration that can be achieved. The base technology, where data are combined,
is RDF, the Resource Description Framework [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The basic tenet of RDF is, everything is a resource that can be connected
to other resources via properties [36]. The basic information unit is an RDF
statement. A statement comprises of a triple: a subject, a property and an object.
A set of triples can jointly form a directed labeled graph that can in theory model
most, if not all, domain knowledge. As a graph, the RDF model is oblivious to
both syntax and semantics, which makes it ideal for combining data. In theory,
RDF can be used to model almost any data. RDF-S is the vocabulary de nition
language for RDF. The inter-relationships between the properties and objects
in RDF are de ned in RDF-S. RDF-S provides a logic, allowing inferences to
be made on the RDF graph. Using the logic de ned in RDF-S and derivation
rules, new statements can be derived from existing statements. Figure 1 gives
the RDF graph for one gene in the Fn genome.</p>
      <p>TPR
RDF:description
+
(3)IMG_S:genomic_location_end
(2)RDFS:comment</p>
      <p>
        (3)IMG_S:genomic_location_start
(1)RDF:type (4)IMG:gene_oid=639752258 (3)IMG_S:locus_tag FTN_0209
Genome data and annotations (see Table 1) for Fn have been combined in
RDF within the Sesame (version 2.11) framework [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] installed on an imac
with 2.33 Ghz Intel core 2 Duo processor and 2GB of memory. A single native
(disk based B-Tree indexes) repository with default \SPOC,POSC" (Subject
Predicate Object Context, Predicate, Object Subject, Context) index con
guration was created. RDF triples were loaded using the sesame console. Perl
scripts were written to parse each data source into RDF Ntriple format. The
repository contains 1,258,677 triples. For the proof of concept we wished to
determine how easily annotations of a particular experimental data set could be
extended using the combined annotations in an RDF graph. Experimental data
from a proteomic experiment studying the transcriptional regulator MglA [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
were also added to the repository in order to test our hypothesis. Additionally,
an RDF Schema (RDF-S) was created for the experimental data set and this
was also added to the repository. Our preliminary results and observations for
Fn are described below.
      </p>
      <sec id="sec-4-1">
        <title>1 (www.openrdf.org/)</title>
        <p>PSN.V1 Membranes.n3/Soluble.n3/Wholecell.n3 748,157 3,430
University of Washington MglA protein abundance data sets from biological samples
Membranes, Soluble and Whole cell
https://wwamirce.gs.washington.edu/cgi-bin/fnu112/poson.cgi?poson=PSN081056
PSN.V1 cogNumberURL.n3 2,548 98.9
University of Washington MglA annotation to COG database
https://wwamirce.gs.washington.edu/cgi-bin/fnu112/poson.cgi?poson=PSN035866
PSN.V3 FnU112Version3.n3 56,754 417.6
Fn genome data from University of Washington
https://wwamirce.gs.washington.edu/cgi-bin/fnu112/poson.cgi?poson=PSN0088754.3
DDB ID interact-prot-peptides.n3 248,647
Fn peptide data from University of Washington
http://regis-web.systemsbiology.net/protXML/protein group/protein/peptide/id/ddb000010839p39
DDB ID interact-prot.n3 20,682 147.5
Fn protein identi cation data from University of Washington
http://regis-web.systemsbiology.net/protXML/protein group/protein/peptide/id/ddb000010839
DDB ID mgla search db.fasta.blastp4 ypURL.n3 1,719 9.7
DDB/PSN mapping from BLAST comparison
http://regis-web.systemsbiology.net/protXML/protein group/protein/protein name/ddb000147854</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>Fn data sources are easily combined into an RDF Graph using Resource
Identiers
The combined RDF graph of Fn data sources can be used as a source for database
cross references. The di erent resources and identi ers used in the RDF sources
are shown in Table 1. A graph showing how these identi ers reconcile is shown
in Figure 2. The Fn genome and annotation data sources were added into the
repository rst, and the FTN IDs, IMG (http://img.jgi.doe.gov/) Gene IDs and
NCBI (http://ncbi.nlm.nih.gov) Protein IDs were connected through the
CONSTRUCT statement shown in Table 2.</p>
      <p>(WashU-B) PSN.V1
(COGs) COGID</p>
      <p>(WashU-B) PSN.V2
(NCBI) PROTEINID
(WashU-B) PSN.V3
(IMG) GENEID</p>
      <p>(WashU-P) DDB
(Fn ORF ID) FTN
(Refseq) ACNo
(Gene Ontology) GOID
(ENZYME) E.C.No
(Uniprot) ACNo</p>
      <p>Further data sources were subsequently added and connected to the graph
(Gene Ontology data, Fn KEGG data and annotations using COGs derived
at the University of Washington). This was done to test our hypothesis that
a connected graph of identi ers could increase the depth of annotation
available to experimental data set, which included three data sets from the
University of Washington (UW). These data sets used a variety of identi ers. The
UW genome data ( le, FnU112Version3) used internal identi ers called POSON
numbers (PSN). There were several versions of these identi ers used internally
and the experimental data generated using the MglA mutant strain ( les,
Membranes.n3/Soluble.n3/Wholecell.n3) used a di erent version of these identi ers.</p>
      <sec id="sec-5-1">
        <title>CONSTRUCT fproteinidg nwrce:hasGeneID fgeneidg</title>
      </sec>
      <sec id="sec-5-2">
        <title>FROM fproteinidg G:locus tag fftng, fgeneidg G:locus tag fftng</title>
        <p>WHERE protein LIKE \http://www.ncbi.nlm.nih.gov*"
AND geneid LIKE \http://img.jgi.doe.gov*"
USING namespace G = &lt;http://img.jgi.doe.gov/schema#&gt;,
nwrce=&lt;https://wwamirce.gs.washington.edu/fnu112/schema#&gt;
Data that mapped across the POSON versions were added to the graph which
enables the internal genome data and the Fn data graphs to connect. A third
data set from a separate lab at the University of Washington used a third
identi er, DDBs. These data were mapped to the existing identi ers through the
addition of data from a BLAST [37] search against the genome data with
sequence identity set to 100%.</p>
        <p>Data integration increases the depth of bioinformatics annotation and reduces
the e ort required to manually annotate data in individual data sources
The depth of annotation available to the experimental data sets was increased
through data integration based on database cross references. The RDF graph of
the experimental data can be queried via the interposed layer of GO, KEGG and
Superfamily descriptions, even though these data were not manually matched to
these databases and provided with explicit annotations. These annotations are
available by integrating data sets that have been manually annotated previously
to at least one data source in the RDF graph. This form of data integration
increases the amount of information available to biologists who now do not have to
manually create each individual database cross reference. Sample SeRQL queries
that show how the MglA experimental data are linked to annotations are shown
below for KEGG and SUPERFAMILY data sets.</p>
        <p>Querying MglA data through KEGG
The query in Table 3 gives the PSN identi ers and their E.C. numbers from the
KEGG database for PSN's whose abundance in the MglA experiment was above
2000. The MglA data was not annotated using KEGG data. These links are
available through the identi er cross references established in the RDF graph.
Querying MglA data through Superfamily
The query in Table 4 shows how the MglA data are linked to the SUPERFAMILY
database in the RDF graph. PSN identi ers used in the MglA data are connected
to FTN identi ers. Superfamily annotations are linked via PID identi ers which
are connected to the FTN identi ers.</p>
        <p>SELECT psn, ec
FROM
fftng rdfs:seeAlso fecg,
fpsng rdfs:seeAlso fftng,
fanalysisg wu:poson fpsng,
fanalysisg mgla:experiment fexpg,
fexpg mgla:abundance fabundanceg
WHERE abundance &gt; 2000
USING NAMESPACE
mgla = &lt;https://wwamirce.gs.washington.edu/fnu112/experiments/mgla/schema#&gt;,
wu = &lt;https://wwamirce.gs.washington.edu/fnu112/schema#&gt;
psn</p>
        <p>exp
ftn
ec</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Discussion and Further Work</title>
      <p>Unique Identi ers
URI's, Uniform Resource Identi ers, are the base concept on which the semantic
web technologies were developed. All things on the semantic web are resources,
and all resources may be identi ed by URIs. The use of globally unique
identication (GUID) can greatly facilitate data integration [38]. For example, when</p>
      <p>SELECT psn, pid, family
FROM
fpsng rdfs:seeAlso fftng,
fpidg gen:locus tag fftng,
fpidg prot:Protein Family ffamilyg,
fanalysisg wu:poson fpsng,
fanalysisg mgla:experiment fexpg,
fexpg mgla:abundance fabundanceg
WHERE abundance &gt; 2000
AND family LIKE "http://supfam.org/SUPERFAMILY/cgi-bin/model.cgi?model=*"
USING NAMESPACE gen = &lt;http://img.jgi.doe.gov/schema#&gt;,
prot = &lt;http://purl.uniprot.org/core/&gt;,
mgla= &lt;https://wwamirce.gs.washington.edu/fnu112/experiments/mgla/schema#&gt;,
wu= &lt;https://wwamirce.gs.washington.edu/fnu112/schema#&gt;
analysis
psn
exp
prot:Protein_Family
gen:locus_tag
rdfs:seeAlso
mgla:abundance
family
ftn
abundance
two data nodes are the same in two resources, those data can be reconciled very
easily if the nodes use GUIDs. In bioinformatics, the databases Genbank and
EMBL share a unique identi er called an Accession number. A user can use
this identi er to retrieve the same sequence in either database. This also means
that this unique identi er can be used to reconcile these sequences if two
separate resources make reference to the same sequence. If the unique identi er is
used, we know that both resources are referring to the same sequence. Where
individual data sources use their own forms of unique identi cation, a URI can
make those identi ers unique, for example, www.protein.org/seq#123456 and
www.gene.org/seq#123456. The use of URI's for unique identi cation can
resolve the issue of the same identi er used in di erent databases referring to
di erent things.</p>
      <p>Lack of persistent unique identi cation in various Fn data sets was a
considerable problem that required some manual mediation in order to resolve and
combine data sets together. For Fn ORFs alone there were upwards of seven
di erent identi ers used for the same entity (see Figure 2). By combining data
in RDF, the di erent identi ers used in the Fn data can be reconciled and the
RDF graph can be used as a source for cross references from the experimental
data and the annotation in public domain data sources. However, in the long
term for semantic web approaches to be successful in biology, data producers
and users need supported tools that can produce and resolve persistent unique
identi ers.</p>
      <p>XML data exchange formats
The bioinformatics community have invested heavily in data exchange formats in
XML. There are numerous examples. MIAME [39] is a standard format for
microarray experiments. The Proteomics Standards Initiative (www.psi.org) have
developed MIAPE for proteomics mass spectrometry data and other standard
exchange formats for chromatography and gel electrophoresis. Data interchanged
in standard formats like these can be readily transformed into RDF. These
formats can also be used as the predicate vocabulary. Wherever possible it was our
aim to use a standard term, when a suitable one existed. Currently, standard,
easily accessible vocabularies are lacking. This has a lot to do with the fact that
the omics XML standards were built as data exchange formats and using them
as vocabularies is out of scope. However, our experience has highlights that
further work is required in this area and some further coordination and extension
of vocabularies and checklists is required.</p>
      <p>We created simple XSLT scripts to convert data from these standard formats
into RDF. Conversion scripts from common data formats such as FASTA and
GenBank 2 have been created using Perl. These scripts are far easier to develop
and more readily reusable than the traditional data warehouse ETL processes
and this mechanism of data interchange is more accessible to biologists.</p>
      <p>Standard vocabulary terms can also facilitate data integration. For example,
just as two nodes that share the same URI are resolved, nodes in di erent graphs
may be linked together by shared predicates. We required and ultimately created
a vocabulary in RDF-S that described the experimental design of the MglA
mutant experiment in order to easily integrate the peptide abundance data with the
standard protein identi cation data that was in the ProtXML format. Although
this paper focuses on integration at the level of resource identi ers, further
integration can be achieved via combing MglA data and the protein identi cations
at the level of properties used in both RDF graphs.</p>
      <sec id="sec-6-1">
        <title>2 http://www.ncbi.nlm.nih.gov</title>
        <p>Annotation of data analysis results
We found that experimental procedures and raw data are easily accessible in
standard representations, however, analysed data, such as those found in
secondary databases and published in papers are generally only available in ad
hoc formats and on journal web pages. While progress has been made in
standardisation of experimental data, the analysis process and the analysed data still
require an exchange standard. This task might be handled partly by work ow
descriptions and by standard vocabularies.</p>
        <p>Ontologies
An ontology can capture the terms and the rules normally associated with
human interpretation into a computationally amenable form. Two domains of data
can therefore be described by an ontology and this allows the data within those
domains to be queried together to enable data discovery. Currently, there are
many tools for developing ontologies and as these mature and become more user
friendly, more ontologies will be published and used for biological data
integration. A few established ontologies are Gene Ontology [40] , Functional Genomics
Ontology [41], Ontology for Biomedical Investigations [42], In uenza Infectious
Disease Ontology [43], Mammalian Phenotype Ontology [44]. So far ontologies
are being developed within local groups for speci c purposes and there are still
only a few community based e orts. However, the distributed nature of the
development in ontologies is not expected to have any serious e ect, since there
is an understanding that ontologies can be merged and used together. Also,
there are now many ontology repositories which are increasing the accessibility
of the growing collection of online ontologies. For example, Ontoselect [45]
collects online ontologies, SchemaWeb [46] is a resource to which users can submit
ontologies, and the NCBO Bioportal [47] is an ontology repository providing
uniform access to online ontologies within the Biology domain. BioPortal
provides a valuable resource with very intuitive search and browse functionality and
visualisations.</p>
        <p>The two ontologies that are relevant to the experimental data sets that we
have are PROTON [48] and the MGED Ontology [49]. PROTON models
concepts, methods, algorithms, tools and databases relevant to the proteomics
domain. The MGED ontology provides terms for concepts used within microarray
experiments. These ontologies were loaded into the RDF repository, however,
disappointingly, so far very little progress has been made combining proteomics
and transcriptomics data with these two ontologies. These ontologies are heavily
loaded with concepts speci c to their domain (and experimental details) and do
not relate the elements of integration, which is the abundance of mRNA and the
abundance of peptides extracted from the organism. They do not describe the
relationship between transcripts and peptides. Further work is required to make
best use of the available ontologies to integrate data that share no common data
elements but are related.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>This paper demonstrates the progress made while testing semantic web
technologies for data integration and highlights gaps and further requirements in
data integration support. We demonstrated that data integration using RDF
is easy to carry out and that simple integration at the level of resource
identi ers can be achieved cheaply and e ciently. The combined data in the RDF
graph provides a resource for database cross references for Fn data. An RDF
dump of the Sesame repository (in N-triple format) can be downloaded from:
http://spira.bio.gla.ac.uk/Francisella/swat4ls.nt.</p>
      <p>This resource increases the depth of annotation available to biologists and this
form of integration reduces the manual e ort that would normally be required to
gain this depth of annotation. Further work will include extending the integration
semantically using RDF-S that maps between predicates used in di erent graphs.
This will be tested in the rst instance on peptide and transcript abundance data.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This work is funded by the BBSRC RASOR grant (BBC5115721). Data from
the University of Washington was provided by Professor Dave Goodlett and Dr
Mitch Brittnacher.
33. Villanueva-Rosales, N. and Dumontier, M.: yOWL: An ontology-driven knowledge
base for yeast biologists. Journal of Biomedical Informatics 41:5 (2008) 779{789
34. Lam, H.Y.K. et. al: AlzPharm: integration of neurodegeneration data using RDF.</p>
      <p>BMC Bioinformatics 8:3 (2007) S4
35. Cheung, K.H. and Yip, K.Y. and Smith, A. and Deknikker, R. and Masiar, A.
and Gerstein, M.: YeastHub: a semantic web use case for integrating data in the life
sciences domain. Bioinformatics 21:1 (2005) i85{i96
36. Decker, S. and Mitra, P. and Melnik, S.: Framework for the Semantic Web: An</p>
      <p>RDF Tutorial. IEEE Internet Computing (2000) 68{73
37. Altschul, S.F. and Madden, T.L. and Scha er, A.A. and Zhang, J. and Zhang, Z.
and Miller, W. and Lipman, D.J.: Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs. Nucleic Acids Research 25:17 (1997) 3390{3402
38. Clark, T. and Martin, S. and Liefeld, T.: Globally distributed object identi cation
for biological knowledgebases. Brie ngs in Bioinformatics 5:1 (2004) 59
39. Brazma, A. et. al.: Minimum information about a microarray experiment
(MIAME)-toward standards for microarray data. Nature Genetics 29 (2001) 365{
372
40. Harris, M.A. et. al: The Gene Ontology (GO) database and informatics resource.</p>
      <p>Nucleic Acids Res 32:1 (2004) D258{61
41. Whetzel, P.L. et. al.: Development of FuGO: An Ontology for Functional Genomics</p>
      <p>Investigations. OMICS: A Journal of Integrative Biology 10:2 (2006) 199{204
42. Smith, B. et. al.: The OBO Foundry: coordinated evolution of ontologies to support
biomedical data integration. Nature Biotechnology 25:11 (2007) 1251{1255
43. Lindsay Cowell and Barry Smith: Infectious Disease Ontology (IDO).</p>
      <p>www.infectiousdiseaseontology.org/
44. Smith, C.L. and Goldsmith, C.A.W. and Eppig, J.T.: The Mammalian Phenotype
Ontology as a tool for annotating, analyzing and comparing phenotypic information.</p>
      <p>Genome Biology 6:1 (2005)
45. Ontoselect http://olp.dfki.de/ontoselect/
46. SchemaWeb http://www.schemaweb.info/
47. NCBO BioPortal http://bioportal.bioontology.org/
48. PROTON (PROTo ONtology) Home Page http://proton.semanticweb.org
49. MGED - Microarray and Gene Expression Data Home http://mged.sourceforge.net</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Hack</surname>
            <given-names>C. J.:</given-names>
          </string-name>
          <article-title>Integrated transcriptome and proteome data: the challenges ahead</article-title>
          .
          <source>Brie ngs in Functional Genomics and Proteomics</source>
          <volume>3</volume>
          :
          <issue>3</issue>
          (
          <year>2004</year>
          )
          <volume>212</volume>
          {
          <fpage>219</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Lassila</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Swick R.R.: Resource Description</surname>
          </string-name>
          <article-title>Framework (RDF) Model and Syntax Speci cation</article-title>
          .
          <source>World Wide Web Consortium W3C</source>
          (
          <year>1999</year>
          ) http://citeseer.ist.psu.edu/212974.html
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Joyce</surname>
            <given-names>AR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palsson</surname>
            <given-names>B</given-names>
          </string-name>
          :
          <article-title>The model organism as a system: integrating 'omics' data sets</article-title>
          .
          <source>Nature Reviews Molecular Cell Biology</source>
          <volume>7</volume>
          :
          <issue>3</issue>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ge</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Walhout</surname>
            ,
            <given-names>A.J.M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Vidal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Integrating omicinformation: a bridge between genomics and systems biology</article-title>
          .
          <source>Trends in Genetics</source>
          <volume>19</volume>
          :
          <fpage>10</fpage>
          (
          <year>2003</year>
          )
          <volume>551</volume>
          {
          <fpage>560</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Conway</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          et. al.:
          <article-title>Microarray expression pro ling: capturing a genome-wide portrait of the transcriptome</article-title>
          .
          <source>Molecular Microbiology</source>
          <volume>47</volume>
          :
          <issue>4</issue>
          (
          <year>2003</year>
          )
          <volume>879</volume>
          {
          <fpage>889</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Tyers</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>From genomics to proteomics</article-title>
          .
          <source>Nature</source>
          <volume>422</volume>
          :
          <fpage>6928</fpage>
          (
          <year>2003</year>
          )
          <volume>193</volume>
          {
          <fpage>197</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Patterson</surname>
            ,
            <given-names>S.D.</given-names>
          </string-name>
          :
          <article-title>Data analysis-the Achilles heel of proteomics</article-title>
          .
          <source>Nature Biotechnology</source>
          <volume>21</volume>
          :
          <issue>3</issue>
          (
          <year>2003</year>
          )
          <volume>221</volume>
          {
          <fpage>222</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Yates</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <article-title>Mass spectrometry from genomics to proteomics</article-title>
          .
          <source>Trends in Genetics 16:1</source>
          (
          <year>2000</year>
          ) 5{
          <fpage>8</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Schadt</surname>
          </string-name>
          , E.E et. al.:
          <article-title>Genetics of gene expression surveyed in maize, mouse and man</article-title>
          .
          <source>Nature</source>
          <volume>422</volume>
          :
          <fpage>6929</fpage>
          (
          <year>2003</year>
          )
          <volume>297</volume>
          {
          <fpage>302</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Schadt</surname>
            ,
            <given-names>E.E.</given-names>
          </string-name>
          et. al.:
          <article-title>An integrative genomics approach to infer causal associations between gene expression and disease</article-title>
          .
          <source>Nature Genetics</source>
          <volume>37</volume>
          (
          <year>2005</year>
          )
          <volume>710</volume>
          {
          <fpage>717</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Karp</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          et. al.:
          <article-title>Identi cation of complement factor 5 as a susceptibility locus for experimental allergic asthma</article-title>
          .
          <source>Nature Immunology</source>
          <volume>1</volume>
          (
          <year>2000</year>
          )
          <volume>221</volume>
          {
          <fpage>226</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Barker</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Klose</surname>
          </string-name>
          , K.E:
          <article-title>Molecular and Genetic Basis of Pathogenesis in Francisella Tularensis</article-title>
          .
          <source>Annals of the New York Academy of Sciences</source>
          <volume>1105</volume>
          (
          <year>2007</year>
          )
          <volume>138</volume>
          {
          <fpage>159</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Rohmer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          et. al.:
          <article-title>Comparison of Francisella tularensis genomes reveals evolutionary events associated with the emergence of human pathogenic strains</article-title>
          .
          <source>Genome Biology</source>
          <volume>8</volume>
          :
          <issue>6</issue>
          (
          <year>2007</year>
          ) R102
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Nano</surname>
            ,
            <given-names>F.E.</given-names>
          </string-name>
          et. al.:
          <article-title>A Francisella tularensis Pathogenicity Island Required for Intramacrophage Growth</article-title>
          .
          <source>Journal of Bacteriology</source>
          <volume>186</volume>
          :
          <fpage>19</fpage>
          (
          <year>2004</year>
          )
          <volume>6430</volume>
          {
          <fpage>6436</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Broekstra</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kampman</surname>
          </string-name>
          , A. and
          <string-name>
            <surname>van Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema</article-title>
          .
          <source>Proceedings of the First International Semantic Web Conference (ISWC</source>
          <year>2002</year>
          )
          <volume>2342 54</volume>
          {
          <fpage>68</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Guina</surname>
          </string-name>
          , T et. al.:
          <article-title>MglA Regulates Francisella tularensis subsp. novicida Response to Starvation and Oxidative Stress</article-title>
          .
          <source>Journal of Bacteriology</source>
          <volume>189</volume>
          :
          <fpage>18</fpage>
          (
          <year>2007</year>
          )
          <volume>6580</volume>
          {
          <fpage>6586</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Brotcke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          et. al.:
          <article-title>Identi cation of MglA-Regulated Genes Reveals Novel Virulence Factors in Francisella tularensis</article-title>
          .
          <source>Infection and Immunity</source>
          <volume>74</volume>
          :
          <fpage>12</fpage>
          (
          <year>2006</year>
          )
          <volume>6642</volume>
          {
          <fpage>6655</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. Lacroix
          <string-name>
            <given-names>Z.</given-names>
            and
            <surname>Crichlow</surname>
          </string-name>
          <string-name>
            <surname>T.</surname>
          </string-name>
          : Bioinformatics, Managing Scienti c Data. Morgan Kaufman (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Gorton</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Architectures and technologies for enterprise application integration</article-title>
          .
          <source>Proceedings of the 26th International Conference on Software Engineering (ICSE</source>
          <year>2004</year>
          )
          <volume>726</volume>
          {
          <fpage>727</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Lord</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          et. al.: Applying Semantic Web Services to Bioinformatics: Experiences Gained,
          <source>Lessons Learnt. Lecture Notes in Computer Science</source>
          (
          <year>2004</year>
          )
          <volume>350</volume>
          {
          <fpage>364</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Curbera</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Duftler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Khalaf</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Nagy</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Mukhi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Weerawarana</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Unraveling the Web Services Web: An Introduction to SOAP, WSDL, and UDDI</article-title>
          .
          <source>IEEE Internet computing 6:22</source>
          (
          <year>2002</year>
          )
          <volume>86</volume>
          {
          <fpage>93</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Karp</surname>
          </string-name>
          , P.D.:
          <article-title>Database links are a foundation for interoperability</article-title>
          .
          <source>Trends in Biotechnology 14:8</source>
          (
          <year>1996</year>
          )
          <volume>273</volume>
          {
          <fpage>279</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Etzold</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ulyanov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Argos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : SRS:
          <article-title>information retrieval system for molecular biology data banks</article-title>
          .
          <source>Methods Enzymol</source>
          <volume>266</volume>
          (
          <year>1996</year>
          )
          <volume>114</volume>
          {
          <fpage>28</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Schuler</surname>
          </string-name>
          ,
          <article-title>GD and Epstein, JA and Ohkawa, H. and</article-title>
          <string-name>
            <surname>Kans</surname>
          </string-name>
          , JA: Entrez:
          <article-title>molecular biology database and retrieval system</article-title>
          .
          <source>Methods Enzymol</source>
          <volume>266</volume>
          (
          <year>1996</year>
          )
          <volume>141</volume>
          {
          <fpage>62</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Stevens</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Wrapping and interoperating bioinformatics resources using CORBA. Brie ngs in Bioinformatics 1 (</article-title>
          <year>2000</year>
          )
          <volume>9</volume>
          {
          <fpage>21</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Davidson</surname>
            , SB and Overton,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Tannen</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Wong</surname>
          </string-name>
          , L.:
          <article-title>BioKleisli: a digital library for biomedical researchers</article-title>
          .
          <source>International Journal on Digital Libraries</source>
          <volume>1</volume>
          (
          <year>1997</year>
          )
          <volume>36</volume>
          {
          <fpage>53</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Haas</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Schwarz</surname>
            ,
            <given-names>P.M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kodali</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kotlar</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rice</surname>
            ,
            <given-names>J.E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Swope</surname>
          </string-name>
          , W.C.:
          <article-title>DiscoveryLink: A system for integrated access to life sciences data sources</article-title>
          .
          <source>IBM Systems Journal</source>
          <volume>40</volume>
          :
          <issue>2</issue>
          (
          <year>2001</year>
          )
          <volume>489</volume>
          {
          <fpage>511</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Sohrab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Yong</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Macaire</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>John</surname>
          </string-name>
          , L. and
          <string-name>
            <surname>Francis</surname>
            ,
            <given-names>O.B.F.</given-names>
          </string-name>
          :
          <article-title>Atlas{a data warehouse for integrative bioinformatics</article-title>
          . http://www.biomedcentral.com/1471-2105/6/34, BMC Bioinformatics 6:
          <fpage>34</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Birkland</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Yona</surname>
          </string-name>
          , G.:
          <article-title>BIOZON: a system for uni cation, management and analysis of heterogeneous biological data</article-title>
          .
          <source>BMC Bioinformatics</source>
          <volume>7</volume>
          :1 (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Kasprzyk</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          et. al.:
          <article-title>EnsMart: A Generic System for Fast and Flexible Access to Biological Data</article-title>
          .
          <source>Genome Research</source>
          <volume>14</volume>
          :
          <issue>1</issue>
          (
          <year>2004</year>
          )
          <volume>160</volume>
          {
          <fpage>169</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Pasquier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Biological data integration using Semantic Web technologies</article-title>
          .
          <source>Biochimie</source>
          <volume>90</volume>
          :
          <issue>4</issue>
          (
          <year>2008</year>
          )
          <volume>584</volume>
          {
          <fpage>594</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Cheung</surname>
            ,
            <given-names>K.H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Yip</surname>
          </string-name>
          , K.Y. and
          <string-name>
            <surname>Schultz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gerstein</surname>
          </string-name>
          , M.B.:
          <article-title>LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval in proteomics</article-title>
          .
          <source>BMC Bioinformatics 8: Suppl</source>
          <volume>3</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>