<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Provenance of Microarray Experiments for a Better Understanding of Experiment Results</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Helena F. Deus</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jun Zhao</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Satya Sahoo</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathias Samwald</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric Prud'hommeaux</string-name>
          <xref ref-type="aff" rid="aff7">7</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Miller</string-name>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>* M.Scott Marshall</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>* Kei-Hoi Cheung</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Medical Informatics, Yale University School of Medicine</institution>
          ,
          <addr-line>New Haven</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Deparment of Zoology, University of Oxford</institution>
          ,
          <addr-line>Oxford</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Bioinformatics and Computational Biology, The University of Texas M. D. Anderson Cancer Center</institution>
          ,
          <addr-line>Houston, USA</addr-line>
          ,
          <institution>Instituto de Tecnologia Química e Biológica, UNL</institution>
          ,
          <addr-line>Lisboa</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Department of Medical Statistics and Bioinformatics, Leiden University Medical Center</institution>
          ,
          <addr-line>Leiden</addr-line>
          ,
          <country country="NL">The Netherlands</country>
          ,
          <institution>Informatics Institute, University of Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Digital Enterprise Research Institute, National University of Ireland Galway</institution>
          ,
          <addr-line>Galway</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Kno.e.sis Center, Department of Computer Science and Engineering, Wright State University</institution>
          ,
          <addr-line>Dayton</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>Tantric Designs</institution>
          ,
          <addr-line>Seattle</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff7">
          <label>7</label>
          <institution>World Wide Web Consortium, MIT, Cambridge</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>-This paper describes a Semantic Web (SW) model for gene lists and the metadata required for their practical interpretation. Our provenance information captures the context of experiments as well as the processing and analysis parameters involved in deriving the gene lists from DNA microarray experiments. We demonstrate a range of practical neuroscience queries which draw on the proposed model. Our provenance representation includes the origins of the gene list and basic information about the data set itself (e.g. last modification date and original data source), in order to facilitate the federation of gene lists with other types of Semantic Web-formatted data and include the integration of a broader molecular context through additional omics data.</p>
      </abstract>
      <kwd-group>
        <kwd>-data integration</kwd>
        <kwd>query federation</kwd>
        <kwd>semantic web</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION</p>
      <p>In the genomics/post-genomics era, massive amounts of
data generated by high throughput experiments, including
those using microarray technologies, have presented both
promises and challenges to clinical, and translational research.
One goal of microarray experiments is to discover, out of tens
of thousands of genes, a small subset of genes (usually on the
order of hundreds) whose expression pattern is indicative of
some biological response to a given experimental condition.</p>
      <p>
        Many computational/statistical approaches have been
developed to detect such biologically significant gene lists.
*These authors contributed equally to this work. KC is supported in part by
NIH grant U24 NS051869. JZ is supported by EPSRC grant
EP/G049327/1. HFD is supported by the portuguese FCT (Fundação para a
Ciência e Tecnologia) scholarship SFRH/BD/45963/2008. The work of MS
was funded by the Science Foundation Ireland under Grant No.
SFI/08/CE/I1380 (Lion-2) and by a postdoctoral fellowship from the Konrad
Lorenz Institute for Evolution and Cognition Research, Austria.
According to [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the workflow of a microarray experiment is
divided into the following steps: i) experimental design that
includes the type of biological questions the experiment is
designed to address, how the experiment is implemented (e.g.,
experiment and control), sample preparation, microarray
platform selection, hybridization process, and scanning; ii)
data extraction, which includes image quantification,
filtering, and normalization; and iii) data analysis and
modeling, which include approaches such as clustering,
ttests, enrichment analysis and so on.
      </p>
      <p>The gene lists produced in step iii are usually reported as
part of the experimental results published in scientific papers,
and the steps involved in obtaining the gene lists are described
in the methods section. Sometimes, gene lists are made
electronically available (e.g., spreadsheets) through journal
web sites. However, to the best of our knowledge, there is no
standard format for uniformly representing and broadly
sharing such gene lists in a focused scientific context.</p>
      <p>
        We believe it would be useful to the community if such
gene lists were commonly represented in a standard SW
vocabulary and accessible to SW applications. This approach
makes it possible for researchers to work with the gene list
without requiring a post hoc significance analysis to re-derive
the list. If experimental factors are included with gene lists,
researchers can account for context without requiring
laborintensive manual research into the experimental factors for
each microarray study. A standard representation can be used
both for gene lists reported in individual papers (note that
these published gene lists are not yet stored in most microarray
databases) and those computed from datasets collected from
multiple microarray experiments across different microarray
databases (e.g., GEO profiles [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Gene Expression Atlas
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]).
      </p>
      <p>
        Integrated analysis (meta-analysis) requires raw and
processed datasets from independent microarray experiments
to be selected, compared, combined, and correlated using a
variety of computational/statistical methods. This is, of course,
much easier with machine-readable provenance and
experimental context. To this end, MIAME [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] was proposed
by the Microarray Gene Expression Data (MGED
(http://www.mged.org)) community (now called “Functional
Genomics Data Society” or FGED) to describe the Minimum
      </p>
    </sec>
    <sec id="sec-2">
      <title>Information About a Microarray Experiment (MIAME) that is</title>
      <p>
        needed to enable the interpretation of the results of the
experiment unambiguously and potentially to reproduce the
experiment. MIAME represents a set of guidelines for
microarray databases and data management software. The
MAGE data model and MAGE-ML (a standard XML format
for serializing the MAGE model) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have been developed
based on the MIAME data content specifications. In addition,
MAGE-TAB [6] was proposed as a (more user-friendly)
alternative to MAGE-ML.
      </p>
      <p>
        Along with the development of these standards, a
significant number of microarray databases ranging from
individual labs (e.g., Nomad at deRisi lab
(http://ucsfnomad.sourceforge.net/)), institutions (e.g., SMD 7], YMD
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and RAD [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) to the scientific community (e.g., GEO [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
and ArrayExpress [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) have been created, making large
collections of microarray datasets accessible to the public.
There are also microarray databases that serve the needs of
specific biomedical domains (e.g., the NIH Neuroscience
Microarray Consortium (http://np2.ctrl.ucla.edu/np2/
home.do)). Major journal publishers have promoted sharing of
microarray data by requiring authors to submit their data to
public microarray repositories. Some journal publishers make
supplemental data available on their web sites.
      </p>
      <p>While many microarray databases are MIAME-compliant,
several challenges still remain for researchers wishing to
locate datasets relevant to their interest:
• There is no central repository for all microarray datasets,
and experiment/dataset are stored on multiple databases.
• Users must learn to use different search interfaces and
analytic facilities at each database.
• Many databases lack experimental context, annotation,
and provenance.
• There is a lack of use of standard vocabularies in many
microarray databases.
• The lists of differentially expressed genes discussed by
most articles associated with a microarray study are not
disclosed in any standard format, nor are they
programmatically accessible.</p>
      <p>
        The Semantic Web [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] has been actively explored in the
context of biomedicine. For example, the W3C Semantic Web
Health Care and Life Sciences Interest Group (HCLS IG)
(http://www.w3.org/2001/sw/hcls/) represents a major
community effort involving both academia and industry. The
HCLS IG and allied efforts provide a growing corpus of
biomedical datasets expressed in the Resource Description
Framework (RDF) and web ontology language (OWL). Wang et
al [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] has described how the transition from the eXtended
Markup Language (XML) to RDF could potentially enhance
semantic representation and integration of omic data. In
addition to data, biomedical ontologies are made available to
the community through organizations such as NCBO
(http://www.bioontology.org/) and OBO Foundry
(http://www.obofoundry.org/).
      </p>
      <p>In this paper we explore using SW to represent microarray
experimental data and provenance information about the context
under which the data were generated, including the goal of the
experiment, experimental factors (such as the disease or the cell
region), and the statistical analysis process which leads to the
experiment results. We explore the role of provenance
information in helping biologists understand microarray
experiments in the context of other experiments as well as other
existing biomedical knowledge. To facilitate a quality-aware
federation of microarray experiment results, we also provide
provenance information about the gene lists data published
using SW standards. As a pilot study, we take a bottom-up
approach focusing on the type of provenance information
required to meet our motivation use cases and creating a
representation model with the minimum set of terms to meet
these use cases. Although these terms are currently defined in
our own namespaces, they can largely be mapped to existing
provenance vocabularies, which are generically defined and
evolving, to achieve maximum interoperability, in the next
stage of our pilot study.</p>
      <p>II.</p>
      <sec id="sec-2-1">
        <title>MOTIVATION</title>
        <p>One motivation of microarray experiments is to identify
genes that are differentially expressed in biological samples
under different conditions (e.g., disease vs. control). The
samples may come from tissues extracted from different
organs or parts of the same organ (e.g., different brain
regions). In this case, we may be able to discover differentially
expressed genes in each organ/organ part and how disease
may affect each organ/organ part at the gene expression level.
A common outcome of experiments is a list of candidate genes
which may serve as diagnostic or therapeutic markers. These
gene lists, abundant in biomedical literature, are provided in
heterogeneous formats (e.g., Excel spreadsheets and printed
tables embedded in papers) that hinder the reuse of the results.
In order to reuse such gene lists in additional pathway or
molecular analysis, it is important that they are represented in
a standardized, distributable, and machine-readable format that
is amenable to semantic queries.</p>
        <p>After obtaining a representative list of differentially
expressed genes, scientists may need to study these
experiment results in a broader molecular context with
additional data. In the case of neurological disease studies
such as Alzheimer's Disease (AD), researchers may want to
combine gene expression data from multiple AD microarray
studies. For example, one characterization of AD is the
formation of intracellular neurofibrillary tangles that affect
neurons in brain regions involved in the memory function. It is
important to have meta-data such as the cell type(s), cell
histopathology, and brain region(s) for comparing/integrating
the results across different AD microarray experiments. It is
important also to consider the (raw) data source and the types
of analysis performed on the data to arrive at meaningful
interpretations. Finally, gene expression data may be
combined with other types of data including genomic
functions, pathways, and associated diseases to broaden the
spectrum of integrative data analysis.</p>
        <p>
          In our pilot study, we selected three microarray
experiments from different journals ([
          <xref ref-type="bibr" rid="ref13 ref14 ref15">13-15</xref>
          ]) to explore how
to represent gene list experiment results in a structured format
and what types of metadata can better enable the computer to
search for genes that may play a molecular role in the
pathogenesis of AD. All the gene lists from the selected
publications were derived from human brain samples that were
prepared for AD studies. We wanted to be able to answer a
variety of user questions regarding semantically related
experiments and their experimental results. For example:
• Q0: What microarray experiments analyze samples taken
from the Entorhinal cortex region of Alzheimer's patients?
• Q1: Was the same data normalization algorithm or
statistical software package used in both studies that
analyze gene expression in the entorhinal cortex region of
AD patients?
• Q2: What genes are overexpressed in the Entorhinal
cortex region in the context of Alzheimer's and what is
their expression fold change and associated p-value?
• Q3: Are there any genes that are expressed differently in
two different brain regions (such as in Hippocampus and
Entorhinal cortex)?
        </p>
        <p>The MIAME standard outlines the minimum set of
information that is needed for describing microarray
experiments in order to facilitate the reproduction of these
experiments and a uniform interpretation of experiment
results. Experiments recording and publishing
MIAMEcompliant experimental protocol should contain sufficient
information to answer questions like Q0 and Q1. However,
because MIAME does not specify a format, and MAGE-ML
and MAGE-TAB do not specify a standard representation for
experiment results (such as the set of genes showing particular
expression patterns), there is no simple mechanism to find
semantically related experimental results based on the patterns
of differentially expressed genes.</p>
        <p>In order to answer questions Q2 and Q3, it is necessary to
model both experimental information (ex: Entorhinal cortex)
and statistical data (e.g. the p-values associated with gene
expression values).</p>
        <p>Additionally, we want to be able to extend the knowledge
about genes linked to AD such that scientists can access and
extend their understandings about their gene expression data
analysis results to answer questions like the following:
• Q4: What other diseases may be associated with the same
genes found to be linked to AD?
• Q5: What drugs are known that affect the same
overexpressed gene products and what are their target
diseases?
• Q6: Select all the genes determined to be differentially
expressed in the Entorhinal cortex in experiments performed
by AD investigators at the Translational Genomics Research
Institute</p>
        <p>For these types of questions, the microarray experiment
results need to be federated (Q4, Q5) or combined (Q6) with
other datasets describing the data itself. We show how the
structured representation of microarray experiment data and
associated provenance metadata will enable us to query across
different aspects of domain knowledge about these experiment
results using several other datasets in the HCLS KB. We also
show how we can provide additional provenance information
about different datasets to support some quality-aware
federation queries over distributed data sources.</p>
        <p>III.</p>
      </sec>
      <sec id="sec-2-2">
        <title>METHODS</title>
        <p>To address questions Q0-Q3 we need both a precise
representation of the gene lists reported in the three selected
publications and a representation of the provenance of these
gene lists, such as the methods and procedures involved in
their generation. As mentioned in Section I, several standards
exist for describing microarray experiment protocols,
however, none is comprehensive enough to fully capture the
complex process of reporting the results of a microarray
experiment. To answer questions Q4-Q5 we need to query
across the exemplar datasets, using provenance information of
different levels of granularity, from the basic information
about the context of each experiment to details about the
analysis processes generating the gene expression results.
Although a number of provenance vocabularies, such as the
open provenance model (OPM, http://openprovenance.org/)
and Provenir (http://wiki.knoesis.org/index.php/
Provenir_Ontology) are available, we choose a bottom-up
approach in this pilot study. On the one hand, at the time of
the writing, little was known about how to choose between
these existing vocabularies to best suit our purpose; on the
other hand, our pilot study aims to focus on capturing the
minimum information to answer our case study questions.
This approach has the added advantage of shielding our model
from having to keep pace with rapidly evolving ontologies
while still enabling mapping to upper level ontologies in the
future. For these reasons, our data model includes the
minimum set of terms necessary to describe the three
examples selected, and is made available under our own local
namespace:</p>
        <p>
          @prefix biordf:http://purl.org/net/biordfmicroarray/ns#
Compared with provenance vocabularies, many domain
specific ontologies are much more established and stable, such
as NIF (http://www.neuinfo.org/), disease ontology (DO,
http://do-wiki.nubic.northwestern.edu/index.php/Main_Page),
or the voiD vocabulary [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Therefore, we reuse terms from
these ontologies that are already widely used to annotate
(biological) datasets in our data model in order to enable
maximum interoperability with other approaches.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>A. The Data Model</title>
      <p>Our data model captures the minimum information
necessary to describe the gene lists and the microarray
experiment context in which they were generated. To answer
each of the individual case study questions, different aspects
of each dataset had to be considered. For example, to answer
questions like Q0 and Q3 a good overview of each microarray
experiment is necessary, including the samples used, the
disease of interest, microarray platform, etc. For questions like
Q1 and Q2, however, a different set of assertions concerned
specifically with comparing gene expression quantification
methods in different settings is required. Finally, the ability to
answer questions like Q4 and Q5 involve the more complex
component of performing simultaneous queries on more than
one data source. As such, information describing the metadata
associated with each data source is also necessary. To
accommodate these different data types in our model, we have
defined four provenance levels, with each level entailing
different subsets of information:
Institutional level: Includes assertions about the laboratory
where the experiments were performed and the reference
where the results were published to help determine the
trustworthiness of the data. This information is useful to
constrain the list of significant genes to only those that are
published in peer-reviewed articles and/or were performed at
certain institutions that have the track record of generating
high quality microarray data published in respected journals.
Experiment protocol level: Includes assertions about the
brain regions from which the samples were gathered and the
histology of the cells. Such information has been partially
mapped to MGED, DO and NIF terms.</p>
      <p>Data analysis and significance level: Includes assertions
about the statistical analysis methodology for selecting the
relevant genes. Terms defined for this level are also provided
as a separate statistic module (http://purl.org/net/
biordfmicroarray/stat#) to describe software tools and
statistical terms.</p>
      <p>
        Dataset description level: Includes assertions about when the
dataset is published, based on which version of a source
dataset, and who published the dataset. Some existing
vocabularies for describing RDF datasets on the Web were
reused to enhance their trustworthiness such as the Vocabulary
of Interlinked Dataset (voiD) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] that provide basic
information about who published the data as well as a summary
of the content of the dataset, such as the number of genes
described by the dataset or the SPARQL endpoint through
which the dataset can be accessed. The Provenance Vocabulary
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] was also used to provide a richer set of provenance
information, such as when the dataset is published, using which
tool, or by accessing which data server.
      </p>
    </sec>
    <sec id="sec-4">
      <title>B. Formulation of SPARQL queries</title>
      <p>The queries described here are formulated at our demo site
(http://purl.org/net/biordfmicroarray/demo), where they can be
directly executed or copied and performed locally using
software such as SWObjects (https://sourceforge.net/
projects/swobjects/files/). The demo site also includes a
diagram explaining the four provenance levels and the types of
data entailed in each level.</p>
      <p>To answer Q0, experiments performed in samples collected
from patients with Alzheimer’s disease in a specific area of the
brain, the Entorhinal cortex, must be selected from the RDF
representation. The data necessary to answer to this question is
completely entailed in the experimental provenance level and
can be formulated in terms of the entities used to represent
each step of the workflow involved in collecting a Sample.
Making use of data from the statistical analysis provenance
level, the same query Q0 can be amended to filter the list of
experiments retrieved based on the statistical normalization
software thus enabling an answer to Q1. To answer questions
Q2 and Q3 data pertaining to the experiment provenance level
must also be combined with information about the gene lists,
such as the expression level for each gene. A common
requirement to measure statistical significance of differentially
expressed genes is the p-value that is associated with gene
expression fold change. In Q2, this information is used to trim
the list of over-expressed genes by indicating that fold change
&gt; 0 but only in cases where the p-value is &lt; 0.001.</p>
      <p>One of the most significant advantages of representing gene
lists in RDF is helping scientists enrich it with data from
linked datasets such that questions like Q4 and Q5 may be
answered. The dataset description provenance level enables
the discovery of useful datasets for specific purposes, such as,
e.g. using the HCLS Kb to discover diseases that may be
associated with specific genes. Q4, detailed below, achieves
that goal by first retrieving the same list of genes as in Q2 and,
secondly, by selecting the most recently updated SPARQL
service which includes assertions about both genes and
diseases. The final section queries this service to retrieve the
correlated diseases.</p>
      <p>SELECT DISTINCT ?diseaseName ?geneLabel ?geneName WHERE {
#Retrieve a list of overexpressed genes in the entorhinal cortex of AD
patients
{
?experimentSet dct:isPartOf ?microarray_experiment ;
biordf:has_input_value ?sampleList ;
biordf:differentially_expressed_gene ?gene ;
biordf:has_ouput_value ?foldChange .
?sampleList biordf:derives_from_region ?brainRegion ;</p>
      <p>biordf:patients_have_disease ?alzheimers .
?gene rdfs:label ?geneLabel ;</p>
      <p>biordf:name ?geneName .
?foldChange rdf:value ?foldChangeValue ;</p>
      <p>stat:p_value ?pval .
#Apply filters to constrain the amount of results</p>
      <p>FILTER (xsd:float(?foldChangeValue) &gt; 0)
FILTER (xsd:float(?pval) &lt; 0.001 )
FILTER (?brainRegion = neurolex:Entorhinal_cortex )</p>
      <p>FILTER (?alzheimers = doid:DOID_10652 )
}
#Find most recently updated SPARQL endpoint that contains information
about genes and diseases.</p>
      <p>{
?source rdf:type void:Dataset ;
void:sparqlEndpoint ?srvc ;
dct:issued ?issued ;
dct:subject diseasome:diseases ;
dct:subject diseasome:genes .</p>
      <p>OPTIONAL {
?source1 rdf:type void:Dataset ;
void:sparqlEndpoint ?srvc2 ;
dct:issued ?issued2 ;
dct:subject diseasome:diseases ;
dct:subject diseasome:genes .</p>
      <p>FILTER (?issued2 &gt; ?issued)
}
FILTER (!BOUND(?srvc2))
}
#Get associated diseases from most recently updated Diseasome server.</p>
      <p>SERVICE ?srvc2 {
?diseasomeGene rdfs:label ?geneLabel .
?disease diseasome:associatedGene ?diseasomeGene.</p>
      <p>?disease rdfs:label ?diseaseName .
}
}</p>
      <p>Finally, to answer Q6 data from the institutional
provenance level we must limit the list of retrieved
experiments to those that were performed at a specific
institution. The queries presented here are executable through
our demo at http://purl.org/net/biordfmicroarray/demo. Their
time to execution ranges between 100 and 200 ms for local
queries (Q1-Q3, Q6) and a few seconds (2-5s) for federated
queries (Q4-Q5) executed using SWObjects.</p>
    </sec>
    <sec id="sec-5">
      <title>C. Availability</title>
      <p>The RDF representation was generated using JavaScript
and the data was loaded into a public SPARQL endpoint
(http://purl.org/net/biordfmicroarray/sparql). We elaborate and
further expand the provenance queries in this paper at our
demo site http://purl.org/net/biordfmicroarray/demo. A figure
associating each of the four provenance levels with the data
that they are concerned with is also made available at the
demo site. The complete RDF/turtle representation can be
downloaded from http://biordfmicroarray.googlecode.com/
files/all3_genelists_provenance.ttl. The JavaScript code to
convert Excel spreadsheets into RDF is available at
http://code.google.com/p/biordfmicroarray/ .</p>
      <p>IV.</p>
      <p>DISCUSSION</p>
      <p>A data model to explicitly make the content and context of
gene lists (e.g., differentially expressed genes) available in
RDF format was developed. In the process, four types of
provenance were identified that were found necessary to
characterize, discover, reproduce, compare and integrate gene
lists with other data. Expressing provenance in RDF enables
describing the data itself (i.e. its origin, version and URL
location) in the same language as the elements represented
therein. The power of this uniform access to data and metadata
should not be underestimated. In practice, this means that
SPARQL queries can express constraints both about the
origins of the data and contents (or attributes) of the data as
demonstrated by query Q4. In the case of Linked Open Data,
the set of best practices for exposing data as RDF through a
SPARQL endpoint, researchers often need to distinguish
between multiple RDF renderings (i.e. representations) of the
same data set or different versions of it. Different endpoints
can be discovered by issuing queries that target the data
sources themselves: When was the last RDF rendering created
and by whom (or which project)? Which
ontologies/vocabularies were used? The same standardized
SW mechanisms of reasoning and pattern matching can be
applied to select a specific data source as the ones used to
discover related facts across the data sources.</p>
      <p>The provenance data model developed for reporting
microarray experiment results while capturing different types
of provenance information was motivated by our user-defined
queries. We have therefore applied a bottom-up approach that
focused on describing the data first before mapping it to
widely used ontologies. Although several provenance
ontologies are available, some of them are upper level
ontologies, such as Provenir, therefore lacking the specific
terms required for describing how gene lists were derived.
Other ontologies, such as the Provenance Vocabulary for
Linked Data and proof markup language, were created for
specific application domains, such as explaining reasoning
results. Our bottom-up approach enabled us to identify and
define the minimum set of provenance terms to answer a set of
queries from different perspectives and shield the data model
from depending on external vocabularies which are often
subject to changes. For increased interoperability, mapping
terms from our model to terms from a community provenance
model, such as the OPM or others is straightforward. For
example, our property biordf:has_input_value can be made a
sub-property of the inverse of OPM property used, and
biordf:derives_from_region can become a sub-property of</p>
    </sec>
    <sec id="sec-6">
      <title>OPM property wasDerivedFrom.</title>
      <p>
        Further down the pipeline of microarray studies,
bioinformaticians will often need to combine knowledge about
the genes derived from their microarray experiments in order
to achieve a deeper understanding at a systems biology level.
Although the number of genes that has to be taken into
consideration while studying Alzheimer’s has been
significantly reduced by many gene expression studies, a good
number of genes (ranging from tens to hundreds) are yet to be
processed. One approach becoming increasingly popular is the
use of scientific workflow workbenches (such as Taverna and
Kepler) to perform large scale data analysis. Many such
workbenches [
        <xref ref-type="bibr" rid="ref19 ref20">19-20</xref>
        ] also record the workflow provenance
information about, for example, what genes from which
organism were processed and how the proteins encoded by the
genes were discovered by querying various genomic
databases. Combining this workflow provenance information
and the set of microarray experiment-related provenance
information by mapping both to a common community
provenance model, such as OPM, the trustworthiness and
reproducibility of experiment results would be increased
throughout the whole experiment life cycle. McCusker et al.
[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] has taken a first step towards by providing a tentative
translation from MGED-TAB to the OPM.
      </p>
      <p>While we endorse the use of SW technologies as the
standard machine-readable format, we acknowledge that most
biologists are not familiar with SW and prefer to use formats
such as Excel spreadsheets to work with gene list results. To
this end, it would be useful to use a standardized user-friendly
format (e.g., MAGE-TAB) for encoding gene lists and their
context that could be easily converted into the SW format.</p>
      <p>CONCLUSION</p>
      <p>We describe and illustrate with a case study the beneficial
role of Semantic Web technologies in ‘omic’ data
representation by providing and querying a data model to
capture provenance information related to reporting
microarray experiment results. We have tackled not only the
engineering aspect of the data integration problem, but also the
more fundamental issues of federating data that begin with
seemingly homogeneous data sources (microarray databases)
and extends to heterogeneous data domains at multiple levels.
This is also driven by the growing collaboration between a
wide spectrum of scientific disciplines and communities such
as is required for translational research. We have used a
bottom-up approach that facilitated the identification of four
provenance levels necessary to report microarray experiment
results and shielded our data model from becoming dependent
on constantly evolving ontologies. We have, however,
discussed how some of the terms and relationships from
existing provenance ontologies can be mapped to our model.
Some issues found to be necessary in the integration of
microarray data sources could also be considered relevant for
the federation of data sources in general. As more ‘omics’ data
are generated, the complexity and requirements for
discoverybased research increases. As a result, there is a growing
demand for effective data provenance and integration at many
levels that counts on the active involvement of scientists and
informaticians. Our work represents a step in this direction.</p>
      <sec id="sec-6-1">
        <title>ACKNOWLEDGMENT</title>
        <p>The authors would like to express their sincere gratitude to
the HCLS IG for helping to organize and coordinate with
different task forces including the BioRDF task force within
which the neuroscience microarray use case was explored. Also
thanks to Helen Parkinson, James Malone, Misha Kapushesky,
Jonas Almeida and three anonymous reviewers. MSM
appreciated the support of Jelle Goeman (LUMC) during this
work.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Stears</surname>
            <given-names>RL</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinsky</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schena</surname>
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Trends in microarray analysis</article-title>
          .
          <source>Nature Medicine. (9)</source>
          :
          <fpage>140</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Barrett</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Troup</surname>
            <given-names>DB</given-names>
          </string-name>
          , et al..
          <article-title>(2009). NCBI GEO: archive for highthroughput functional genomic data</article-title>
          .
          <source>Nucleic Acids Res</source>
          .
          <year>2009</year>
          Jan;
          <volume>37</volume>
          (Database issue):
          <fpage>D885</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Lukk</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kapushesky</surname>
            <given-names>M</given-names>
          </string-name>
          , et al..
          <article-title>A global map of human gene expression</article-title>
          .
          <source>Nat Biotechnology</source>
          <volume>28</volume>
          ,
          <fpage>322</fpage>
          -
          <lpage>324</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Brazma</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hingamp</surname>
            <given-names>P</given-names>
          </string-name>
          , et al..
          <source>(</source>
          <year>2001</year>
          ).
          <article-title>Minimum information about a microarray experiment (MIAME)-toward standards for microarray data</article-title>
          .
          <source>Nat Genet</source>
          .
          <volume>29</volume>
          (
          <issue>4</issue>
          ):
          <fpage>365</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Spellman</surname>
            <given-names>PT</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            <given-names>M</given-names>
          </string-name>
          , et al..
          <source>(</source>
          <year>2002</year>
          ).
          <article-title>Design and implementation of microarray gene expression markup language (MAGE-ML)</article-title>
          .
          <source>Genome Biol</source>
          .
          <volume>3</volume>
          (
          <issue>9</issue>
          ):
          <fpage>RESEARCH0046</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Rayner</surname>
            <given-names>TF</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rocca-Serra</surname>
            <given-names>P</given-names>
          </string-name>
          , et al..
          <source>(</source>
          <year>2006</year>
          ).
          <article-title>A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB</article-title>
          .
          <source>BMC Bioinformatics</source>
          .
          <volume>7</volume>
          :
          <fpage>489</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Gollub</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ball</surname>
            <given-names>CA</given-names>
          </string-name>
          , et al..
          <source>(</source>
          <year>2003</year>
          ).
          <article-title>The Stanford Microarray Database: data access and quality assessment tools</article-title>
          .
          <source>Nucleic Acids Res</source>
          .
          <volume>31</volume>
          (
          <issue>1</issue>
          ):
          <fpage>94</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Cheung</surname>
            <given-names>KH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>White</surname>
            <given-names>K</given-names>
          </string-name>
          , et al..
          <source>(</source>
          <year>2002</year>
          ).
          <article-title>YMD: a microarray database for large-scale gene expression analysis</article-title>
          .
          <source>Proc AMIA Symp</source>
          .
          <year>2002</year>
          :
          <fpage>140</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Manduchi</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grant</surname>
            <given-names>GR</given-names>
          </string-name>
          , et al..
          <source>(</source>
          <year>2004</year>
          ).
          <article-title>RAD and the RAD StudyAnnotator: an approach to collection, organization and exchange of all relevant information for high-throughput gene expression studies</article-title>
          .
          <source>Bioinformatics</source>
          .
          <volume>20</volume>
          (
          <issue>4</issue>
          ):
          <fpage>452</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Parkinson</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sarkans</surname>
            <given-names>U</given-names>
          </string-name>
          , et al..
          <source>(</source>
          <year>2005</year>
          ).
          <article-title>ArrayExpress--a public repository for microarray gene expression data at the EBI</article-title>
          .
          <source>Nucleic Acids Res</source>
          .
          <volume>33</volume>
          (Database issue):
          <fpage>D553</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Berners-Lee</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lassila</surname>
            <given-names>O.</given-names>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>The Semantic Web</article-title>
          .
          <source>Scientific American</source>
          .
          <volume>284</volume>
          (
          <issue>5</issue>
          ):
          <fpage>34</fpage>
          -
          <lpage>43</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Wang</surname>
            <given-names>X</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gorlitsky</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Almeida</surname>
            <given-names>JS</given-names>
          </string-name>
          .
          <article-title>(2005) From XML to RDF: how semantic web technologies will change the design of 'omic' standards</article-title>
          .
          <source>Nat Biotechnol</source>
          .
          <volume>23</volume>
          (
          <issue>9</issue>
          ):
          <fpage>1099</fpage>
          -
          <lpage>103</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Dunckley</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beach</surname>
            <given-names>TG</given-names>
          </string-name>
          , et al..
          <source>(</source>
          <year>2006</year>
          ).
          <article-title>Gene expression correlates of neurofibrillary tangles in Alzheimer's disease</article-title>
          . Neurobiol Aging;
          <volume>27</volume>
          :
          <fpage>1359</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Liang</surname>
            <given-names>WS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dunckley</surname>
            <given-names>T</given-names>
          </string-name>
          , et al..
          <source>(</source>
          <year>2007</year>
          ).
          <article-title>Gene expression profiles in anatomically and functionally distinct regions of the normal aged human brain</article-title>
          .
          <source>Physiol Genomics</source>
          <volume>28</volume>
          :
          <fpage>311</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Liang</surname>
            <given-names>WS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reiman</surname>
            <given-names>EM</given-names>
          </string-name>
          , et al..
          <source>(</source>
          <year>2008</year>
          ).
          <article-title>Alzheimer's disease is associated with reduced expression of energy metabolism genes in posterior cingulate neurons</article-title>
          .
          <source>Proc Natl Acad Sci U S A</source>
          <volume>l2008</volume>
          ;
          <volume>105</volume>
          :
          <fpage>4441</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Alexander</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            <given-names>R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hausenblas</surname>
            <given-names>M,</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zhao</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Describing linked datasets</article-title>
          .
          <source>In Linked Data on the Web Workshop in the International World Wide Web Conference</source>
          , Madrid, Spain,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Hartig</surname>
            <given-names>O</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Publishing and consuming provenance metadata on the web of linked data</article-title>
          .
          <source>In Proceedings of The third International Provenance and Annotation Workshop</source>
          , Troy, NY,
          <string-name>
            <surname>U.S.A</surname>
          </string-name>
          ,
          <year>2010</year>
          . In press
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Kotecha</surname>
            <given-names>N</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bruck</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            <given-names>N.</given-names>
          </string-name>
          <article-title>Pathway knowledge base: An integrated pathway resource using BioPAX</article-title>
          .
          <source>Applied Ontology</source>
          .
          <volume>3</volume>
          (
          <issue>4</issue>
          );
          <fpage>235</fpage>
          -
          <lpage>245</lpage>
          .
          <year>2008</year>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Missier</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sahoo</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goble</surname>
            <given-names>C</given-names>
          </string-name>
          and
          <article-title>Sheth A. Janus: Semantic Provenance Infrastructure for Taverna</article-title>
          .
          <source>In Proceedings of The third International Provenance and Annotation Workshop</source>
          , Troy, NY,
          <string-name>
            <surname>U.S.A</surname>
          </string-name>
          ,
          <year>2010</year>
          . In press
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Altintas</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anand</surname>
            <given-names>M</given-names>
          </string-name>
          , et al..
          <article-title>Understanding Collaborative Studies Through Interoperable Workflow Provenance</article-title>
          .
          <source>In Proceedings of The third International Provenance and Annotation Workshop</source>
          , Troy, NY,
          <string-name>
            <surname>U.S.A</surname>
          </string-name>
          ,
          <year>2010</year>
          . In press
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>McCusker</surname>
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>McGuinness D.</surname>
          </string-name>
          <article-title>Explorations into the Provenance of High Throughput Biomedical Experiments</article-title>
          .
          <source>In Proceedings of The third International Provenance and Annotation Workshop</source>
          , Troy, NY,
          <string-name>
            <surname>U.S.A</surname>
          </string-name>
          ,
          <year>2010</year>
          . In press
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>