=Paper= {{Paper |id=None |storemode=property |title=Provenance of Microarray Experiments for a Better Understanding of Experiment Results |pdfUrl=https://ceur-ws.org/Vol-670/paper_6.pdf |volume=Vol-670 |dblpUrl=https://dblp.org/rec/conf/semweb/DeusZSSPMMC10 }} ==Provenance of Microarray Experiments for a Better Understanding of Experiment Results== https://ceur-ws.org/Vol-670/paper_6.pdf

Provenance of Microarray Experiments for a
Better Understanding of Experiment Results
Helena F. Deus
Department of Bioinformatics and Computational Biology Eric Prud’hommeaux
The University of Texas M. D. Anderson Cancer Center World Wide Web Consortium
Houston, USA MIT
Instituto de Tecnologia Química e Biológica, UNL Cambridge, USA
Lisboa, Portugal
Michael Miller
Jun Zhao Tantric Designs
Deparment of Zoology Seattle, USA
University of Oxford
Oxford, UK * M.Scott Marshall
Department of Medical Statistics and Bioinformatics
Satya Sahoo Leiden University Medical Center
Kno.e.sis Center Leiden, The Netherlands
Department of Computer Science and Engineering Informatics Institute
Wright State University University of Amsterdam
Dayton, USA Amsterdam, The Netherlands

Mathias Samwald * Kei-Hoi Cheung
Digital Enterprise Research Institute Center for Medical Informatics
National University of Ireland Galway Yale University School of Medicine
Galway, Ireland New Haven, USA

Abstract—This paper describes a Semantic Web (SW) model for According to [1], the workflow of a microarray experiment is
gene lists and the metadata required for their practical divided into the following steps: i) experimental design that
interpretation. Our provenance information captures the context includes the type of biological questions the experiment is
of experiments as well as the processing and analysis parameters designed to address, how the experiment is implemented (e.g.,
involved in deriving the gene lists from DNA microarray
experiment and control), sample preparation, microarray
experiments. We demonstrate a range of practical neuroscience
queries which draw on the proposed model. Our provenance platform selection, hybridization process, and scanning; ii)
representation includes the origins of the gene list and basic data extraction, which includes image quantification,
information about the data set itself (e.g. last modification date filtering, and normalization; and iii) data analysis and
and original data source), in order to facilitate the federation of modeling, which include approaches such as clustering, t-
gene lists with other types of Semantic Web-formatted data and tests, enrichment analysis and so on.
include the integration of a broader molecular context through The gene lists produced in step iii are usually reported as
additional omics data. part of the experimental results published in scientific papers,
and the steps involved in obtaining the gene lists are described
Keywords-data integration, query federation, semantic web
in the methods section. Sometimes, gene lists are made
electronically available (e.g., spreadsheets) through journal
I. INTRODUCTION web sites. However, to the best of our knowledge, there is no
In the genomics/post-genomics era, massive amounts of standard format for uniformly representing and broadly
data generated by high throughput experiments, including sharing such gene lists in a focused scientific context.
those using microarray technologies, have presented both We believe it would be useful to the community if such
promises and challenges to clinical, and translational research. gene lists were commonly represented in a standard SW
One goal of microarray experiments is to discover, out of tens vocabulary and accessible to SW applications. This approach
of thousands of genes, a small subset of genes (usually on the makes it possible for researchers to work with the gene list
order of hundreds) whose expression pattern is indicative of without requiring a post hoc significance analysis to re-derive
some biological response to a given experimental condition. the list. If experimental factors are included with gene lists,
Many computational/statistical approaches have been researchers can account for context without requiring labor-
developed to detect such biologically significant gene lists. intensive manual research into the experimental factors for

*These authors contributed equally to this work. KC is supported in part by
NIH grant U24 NS051869. JZ is supported by EPSRC grant
EP/G049327/1. HFD is supported by the portuguese FCT (Fundação para a
Ciência e Tecnologia) scholarship SFRH/BD/45963/2008. The work of MS
was funded by the Science Foundation Ireland under Grant No.
SFI/08/CE/I1380 (Lion-2) and by a postdoctoral fellowship from the Konrad
Lorenz Institute for Evolution and Cognition Research, Austria.
each microarray study. A standard representation can be used The Semantic Web [11] has been actively explored in the
both for gene lists reported in individual papers (note that context of biomedicine. For example, the W3C Semantic Web
these published gene lists are not yet stored in most microarray Health Care and Life Sciences Interest Group (HCLS IG)
databases) and those computed from datasets collected from (http://www.w3.org/2001/sw/hcls/) represents a major
multiple microarray experiments across different microarray community effort involving both academia and industry. The
databases (e.g., GEO profiles [2] and Gene Expression Atlas HCLS IG and allied efforts provide a growing corpus of
[3]). biomedical datasets expressed in the Resource Description
Integrated analysis (meta-analysis) requires raw and Framework (RDF) and web ontology language (OWL). Wang et
processed datasets from independent microarray experiments al [12] has described how the transition from the eXtended
to be selected, compared, combined, and correlated using a Markup Language (XML) to RDF could potentially enhance
variety of computational/statistical methods. This is, of course, semantic representation and integration of omic data. In
much easier with machine-readable provenance and addition to data, biomedical ontologies are made available to
experimental context. To this end, MIAME [4] was proposed the community through organizations such as NCBO
by the Microarray Gene Expression Data (MGED (http://www.bioontology.org/) and OBO Foundry
(http://www.mged.org)) community (now called “Functional (http://www.obofoundry.org/).
Genomics Data Society” or FGED) to describe the Minimum
Information About a Microarray Experiment (MIAME) that is In this paper we explore using SW to represent microarray
needed to enable the interpretation of the results of the experimental data and provenance information about the context
experiment unambiguously and potentially to reproduce the under which the data were generated, including the goal of the
experiment. MIAME represents a set of guidelines for experiment, experimental factors (such as the disease or the cell
microarray databases and data management software. The region), and the statistical analysis process which leads to the
MAGE data model and MAGE-ML (a standard XML format experiment results. We explore the role of provenance
for serializing the MAGE model) [5] have been developed information in helping biologists understand microarray
based on the MIAME data content specifications. In addition, experiments in the context of other experiments as well as other
MAGE-TAB [6] was proposed as a (more user-friendly) existing biomedical knowledge. To facilitate a quality-aware
alternative to MAGE-ML. federation of microarray experiment results, we also provide
Along with the development of these standards, a provenance information about the gene lists data published
significant number of microarray databases ranging from using SW standards. As a pilot study, we take a bottom-up
individual labs (e.g., Nomad at deRisi lab (http://ucsf- approach focusing on the type of provenance information
nomad.sourceforge.net/)), institutions (e.g., SMD 7], YMD required to meet our motivation use cases and creating a
[8], and RAD [9]) to the scientific community (e.g., GEO [2] representation model with the minimum set of terms to meet
and ArrayExpress [10]) have been created, making large these use cases. Although these terms are currently defined in
collections of microarray datasets accessible to the public. our own namespaces, they can largely be mapped to existing
There are also microarray databases that serve the needs of provenance vocabularies, which are generically defined and
specific biomedical domains (e.g., the NIH Neuroscience evolving, to achieve maximum interoperability, in the next
Microarray Consortium (http://np2.ctrl.ucla.edu/np2/ stage of our pilot study.
home.do)). Major journal publishers have promoted sharing of
microarray data by requiring authors to submit their data to II. MOTIVATION
public microarray repositories. Some journal publishers make One motivation of microarray experiments is to identify
supplemental data available on their web sites. genes that are differentially expressed in biological samples
While many microarray databases are MIAME-compliant, under different conditions (e.g., disease vs. control). The
several challenges still remain for researchers wishing to samples may come from tissues extracted from different
locate datasets relevant to their interest: organs or parts of the same organ (e.g., different brain
regions). In this case, we may be able to discover differentially
• There is no central repository for all microarray datasets, expressed genes in each organ/organ part and how disease
and experiment/dataset are stored on multiple databases. may affect each organ/organ part at the gene expression level.
• Users must learn to use different search interfaces and A common outcome of experiments is a list of candidate genes
analytic facilities at each database. which may serve as diagnostic or therapeutic markers. These
• Many databases lack experimental context, annotation, gene lists, abundant in biomedical literature, are provided in
and provenance. heterogeneous formats (e.g., Excel spreadsheets and printed
• There is a lack of use of standard vocabularies in many tables embedded in papers) that hinder the reuse of the results.
microarray databases. In order to reuse such gene lists in additional pathway or
• The lists of differentially expressed genes discussed by molecular analysis, it is important that they are represented in
most articles associated with a microarray study are not a standardized, distributable, and machine-readable format that
disclosed in any standard format, nor are they is amenable to semantic queries.
programmatically accessible. After obtaining a representative list of differentially
expressed genes, scientists may need to study these
experiment results in a broader molecular context with
additional data. In the case of neurological disease studies • Q4: What other diseases may be associated with the same
such as Alzheimer's Disease (AD), researchers may want to genes found to be linked to AD?
combine gene expression data from multiple AD microarray • Q5: What drugs are known that affect the same
studies. For example, one characterization of AD is the overexpressed gene products and what are their target
formation of intracellular neurofibrillary tangles that affect diseases?
neurons in brain regions involved in the memory function. It is • Q6: Select all the genes determined to be differentially
important to have meta-data such as the cell type(s), cell expressed in the Entorhinal cortex in experiments performed
histopathology, and brain region(s) for comparing/integrating by AD investigators at the Translational Genomics Research
the results across different AD microarray experiments. It is Institute
important also to consider the (raw) data source and the types For these types of questions, the microarray experiment
of analysis performed on the data to arrive at meaningful results need to be federated (Q4, Q5) or combined (Q6) with
interpretations. Finally, gene expression data may be other datasets describing the data itself. We show how the
combined with other types of data including genomic structured representation of microarray experiment data and
functions, pathways, and associated diseases to broaden the associated provenance metadata will enable us to query across
spectrum of integrative data analysis. different aspects of domain knowledge about these experiment
In our pilot study, we selected three microarray results using several other datasets in the HCLS KB. We also
experiments from different journals ([13-15]) to explore how show how we can provide additional provenance information
to represent gene list experiment results in a structured format about different datasets to support some quality-aware
and what types of metadata can better enable the computer to federation queries over distributed data sources.
search for genes that may play a molecular role in the
pathogenesis of AD. All the gene lists from the selected III. METHODS
publications were derived from human brain samples that were To address questions Q0-Q3 we need both a precise
prepared for AD studies. We wanted to be able to answer a representation of the gene lists reported in the three selected
variety of user questions regarding semantically related publications and a representation of the provenance of these
experiments and their experimental results. For example: gene lists, such as the methods and procedures involved in
• Q0: What microarray experiments analyze samples taken their generation. As mentioned in Section I, several standards
from the Entorhinal cortex region of Alzheimer's patients? exist for describing microarray experiment protocols,
• Q1: Was the same data normalization algorithm or however, none is comprehensive enough to fully capture the
statistical software package used in both studies that complex process of reporting the results of a microarray
analyze gene expression in the entorhinal cortex region of experiment. To answer questions Q4-Q5 we need to query
AD patients? across the exemplar datasets, using provenance information of
• Q2: What genes are overexpressed in the Entorhinal different levels of granularity, from the basic information
cortex region in the context of Alzheimer's and what is about the context of each experiment to details about the
their expression fold change and associated p-value? analysis processes generating the gene expression results.
• Q3: Are there any genes that are expressed differently in Although a number of provenance vocabularies, such as the
two different brain regions (such as in Hippocampus and open provenance model (OPM, http://openprovenance.org/)
Entorhinal cortex)? and Provenir (http://wiki.knoesis.org/index.php/
The MIAME standard outlines the minimum set of Provenir_Ontology) are available, we choose a bottom-up
information that is needed for describing microarray approach in this pilot study. On the one hand, at the time of
experiments in order to facilitate the reproduction of these the writing, little was known about how to choose between
experiments and a uniform interpretation of experiment these existing vocabularies to best suit our purpose; on the
results. Experiments recording and publishing MIAME- other hand, our pilot study aims to focus on capturing the
compliant experimental protocol should contain sufficient minimum information to answer our case study questions.
information to answer questions like Q0 and Q1. However, This approach has the added advantage of shielding our model
because MIAME does not specify a format, and MAGE-ML from having to keep pace with rapidly evolving ontologies
and MAGE-TAB do not specify a standard representation for while still enabling mapping to upper level ontologies in the
experiment results (such as the set of genes showing particular future. For these reasons, our data model includes the
expression patterns), there is no simple mechanism to find minimum set of terms necessary to describe the three
semantically related experimental results based on the patterns examples selected, and is made available under our own local
of differentially expressed genes. namespace:
In order to answer questions Q2 and Q3, it is necessary to
model both experimental information (ex: Entorhinal cortex) @prefix biordf:http://purl.org/net/biordfmicroarray/ns#
and statistical data (e.g. the p-values associated with gene
expression values). Compared with provenance vocabularies, many domain
Additionally, we want to be able to extend the knowledge specific ontologies are much more established and stable, such
about genes linked to AD such that scientists can access and as NIF (http://www.neuinfo.org/), disease ontology (DO,
extend their understandings about their gene expression data http://do-wiki.nubic.northwestern.edu/index.php/Main_Page),
analysis results to answer questions like the following: or the voiD vocabulary [16]. Therefore, we reuse terms from
these ontologies that are already widely used to annotate directly executed or copied and performed locally using
(biological) datasets in our data model in order to enable software such as SWObjects (https://sourceforge.net/
maximum interoperability with other approaches. projects/swobjects/files/). The demo site also includes a
diagram explaining the four provenance levels and the types of
A. The Data Model
data entailed in each level.
Our data model captures the minimum information To answer Q0, experiments performed in samples collected
necessary to describe the gene lists and the microarray from patients with Alzheimer’s disease in a specific area of the
experiment context in which they were generated. To answer brain, the Entorhinal cortex, must be selected from the RDF
each of the individual case study questions, different aspects representation. The data necessary to answer to this question is
of each dataset had to be considered. For example, to answer completely entailed in the experimental provenance level and
questions like Q0 and Q3 a good overview of each microarray can be formulated in terms of the entities used to represent
experiment is necessary, including the samples used, the each step of the workflow involved in collecting a Sample.
disease of interest, microarray platform, etc. For questions like Making use of data from the statistical analysis provenance
Q1 and Q2, however, a different set of assertions concerned level, the same query Q0 can be amended to filter the list of
specifically with comparing gene expression quantification experiments retrieved based on the statistical normalization
methods in different settings is required. Finally, the ability to software thus enabling an answer to Q1. To answer questions
answer questions like Q4 and Q5 involve the more complex Q2 and Q3 data pertaining to the experiment provenance level
component of performing simultaneous queries on more than must also be combined with information about the gene lists,
one data source. As such, information describing the metadata such as the expression level for each gene. A common
associated with each data source is also necessary. To requirement to measure statistical significance of differentially
accommodate these different data types in our model, we have expressed genes is the p-value that is associated with gene
defined four provenance levels, with each level entailing expression fold change. In Q2, this information is used to trim
different subsets of information: the list of over-expressed genes by indicating that fold change
Institutional level: Includes assertions about the laboratory > 0 but only in cases where the p-value is < 0.001.
where the experiments were performed and the reference One of the most significant advantages of representing gene
where the results were published to help determine the lists in RDF is helping scientists enrich it with data from
trustworthiness of the data. This information is useful to linked datasets such that questions like Q4 and Q5 may be
constrain the list of significant genes to only those that are answered. The dataset description provenance level enables
published in peer-reviewed articles and/or were performed at the discovery of useful datasets for specific purposes, such as,
certain institutions that have the track record of generating e.g. using the HCLS Kb to discover diseases that may be
high quality microarray data published in respected journals. associated with specific genes. Q4, detailed below, achieves
Experiment protocol level: Includes assertions about the that goal by first retrieving the same list of genes as in Q2 and,
brain regions from which the samples were gathered and the secondly, by selecting the most recently updated SPARQL
histology of the cells. Such information has been partially service which includes assertions about both genes and
mapped to MGED, DO and NIF terms. diseases. The final section queries this service to retrieve the
Data analysis and significance level: Includes assertions correlated diseases.
about the statistical analysis methodology for selecting the
relevant genes. Terms defined for this level are also provided SELECT DISTINCT ?diseaseName ?geneLabel ?geneName WHERE {
as a separate statistic module (http://purl.org/net/ #Retrieve a list of overexpressed genes in the entorhinal cortex of AD
biordfmicroarray/stat#) to describe software tools and patients
{
statistical terms.
?experimentSet dct:isPartOf ?microarray_experiment ;
Dataset description level: Includes assertions about when the
biordf:has_input_value ?sampleList ;
dataset is published, based on which version of a source
biordf:differentially_expressed_gene ?gene ;
dataset, and who published the dataset. Some existing
biordf:has_ouput_value ?foldChange .
vocabularies for describing RDF datasets on the Web were
?sampleList biordf:derives_from_region ?brainRegion ;
reused to enhance their trustworthiness such as the Vocabulary biordf:patients_have_disease ?alzheimers .
of Interlinked Dataset (voiD) [16] that provide basic ?gene rdfs:label ?geneLabel ;
information about who published the data as well as a summary biordf:name ?geneName .
of the content of the dataset, such as the number of genes ?foldChange rdf:value ?foldChangeValue ;
described by the dataset or the SPARQL endpoint through stat:p_value ?pval .
which the dataset can be accessed. The Provenance Vocabulary #Apply filters to constrain the amount of results
[17] was also used to provide a richer set of provenance FILTER (xsd:float(?foldChangeValue) > 0)
information, such as when the dataset is published, using which FILTER (xsd:float(?pval) < 0.001 )
tool, or by accessing which data server. FILTER (?brainRegion = neurolex:Entorhinal_cortex )
FILTER (?alzheimers = doid:DOID_10652 )
B. Formulation of SPARQL queries }
The queries described here are formulated at our demo site #Find most recently updated SPARQL endpoint that contains information
about genes and diseases.
(http://purl.org/net/biordfmicroarray/demo), where they can be
{
?source rdf:type void:Dataset ; demonstrated by query Q4. In the case of Linked Open Data,
void:sparqlEndpoint ?srvc ; the set of best practices for exposing data as RDF through a
dct:issued ?issued ; SPARQL endpoint, researchers often need to distinguish
dct:subject diseasome:diseases ;
between multiple RDF renderings (i.e. representations) of the
dct:subject diseasome:genes .
same data set or different versions of it. Different endpoints
OPTIONAL {
?source1 rdf:type void:Dataset ;
can be discovered by issuing queries that target the data
void:sparqlEndpoint ?srvc2 ; sources themselves: When was the last RDF rendering created
dct:issued ?issued2 ; and by whom (or which project)? Which
dct:subject diseasome:diseases ; ontologies/vocabularies were used? The same standardized
dct:subject diseasome:genes . SW mechanisms of reasoning and pattern matching can be
FILTER (?issued2 > ?issued) applied to select a specific data source as the ones used to
} discover related facts across the data sources.
FILTER (!BOUND(?srvc2)) The provenance data model developed for reporting
} microarray experiment results while capturing different types
#Get associated diseases from most recently updated Diseasome server. of provenance information was motivated by our user-defined
SERVICE ?srvc2 { queries. We have therefore applied a bottom-up approach that
?diseasomeGene rdfs:label ?geneLabel . focused on describing the data first before mapping it to
?disease diseasome:associatedGene ?diseasomeGene.
widely used ontologies. Although several provenance
?disease rdfs:label ?diseaseName .
ontologies are available, some of them are upper level
}
}
ontologies, such as Provenir, therefore lacking the specific
terms required for describing how gene lists were derived.
Finally, to answer Q6 data from the institutional
Other ontologies, such as the Provenance Vocabulary for
provenance level we must limit the list of retrieved
experiments to those that were performed at a specific Linked Data and proof markup language, were created for
institution. The queries presented here are executable through specific application domains, such as explaining reasoning
our demo at http://purl.org/net/biordfmicroarray/demo. Their results. Our bottom-up approach enabled us to identify and
time to execution ranges between 100 and 200 ms for local define the minimum set of provenance terms to answer a set of
queries (Q1-Q3, Q6) and a few seconds (2-5s) for federated queries from different perspectives and shield the data model
queries (Q4-Q5) executed using SWObjects. from depending on external vocabularies which are often
subject to changes. For increased interoperability, mapping
C. Availability terms from our model to terms from a community provenance
model, such as the OPM or others is straightforward. For
The RDF representation was generated using JavaScript
example, our property biordf:has_input_value can be made a
and the data was loaded into a public SPARQL endpoint
sub-property of the inverse of OPM property used, and
(http://purl.org/net/biordfmicroarray/sparql). We elaborate and
biordf:derives_from_region can become a sub-property of
further expand the provenance queries in this paper at our
OPM property wasDerivedFrom.
demo site http://purl.org/net/biordfmicroarray/demo. A figure
Further down the pipeline of microarray studies,
associating each of the four provenance levels with the data
bioinformaticians will often need to combine knowledge about
that they are concerned with is also made available at the
the genes derived from their microarray experiments in order
demo site. The complete RDF/turtle representation can be
to achieve a deeper understanding at a systems biology level.
downloaded from http://biordfmicroarray.googlecode.com/
Although the number of genes that has to be taken into
files/all3_genelists_provenance.ttl. The JavaScript code to
consideration while studying Alzheimer’s has been
convert Excel spreadsheets into RDF is available at
significantly reduced by many gene expression studies, a good
http://code.google.com/p/biordfmicroarray/ .
number of genes (ranging from tens to hundreds) are yet to be
processed. One approach becoming increasingly popular is the
use of scientific workflow workbenches (such as Taverna and
IV. DISCUSSION Kepler) to perform large scale data analysis. Many such
A data model to explicitly make the content and context of workbenches [19-20] also record the workflow provenance
gene lists (e.g., differentially expressed genes) available in information about, for example, what genes from which
RDF format was developed. In the process, four types of organism were processed and how the proteins encoded by the
provenance were identified that were found necessary to genes were discovered by querying various genomic
characterize, discover, reproduce, compare and integrate gene databases. Combining this workflow provenance information
lists with other data. Expressing provenance in RDF enables and the set of microarray experiment-related provenance
describing the data itself (i.e. its origin, version and URL information by mapping both to a common community
location) in the same language as the elements represented provenance model, such as OPM, the trustworthiness and
therein. The power of this uniform access to data and metadata reproducibility of experiment results would be increased
should not be underestimated. In practice, this means that throughout the whole experiment life cycle. McCusker et al.
SPARQL queries can express constraints both about the [21] has taken a first step towards by providing a tentative
origins of the data and contents (or attributes) of the data as translation from MGED-TAB to the OPM.
While we endorse the use of SW technologies as the [4] Brazma A, Hingamp P, et al.. (2001). Minimum information about a
microarray experiment (MIAME)-toward standards for microarray data.
standard machine-readable format, we acknowledge that most Nat Genet. 29(4):365-71.
biologists are not familiar with SW and prefer to use formats
such as Excel spreadsheets to work with gene list results. To [5] Spellman PT, Miller M, et al.. (2002). Design and implementation of
microarray gene expression markup language (MAGE-ML). Genome
this end, it would be useful to use a standardized user-friendly Biol. 3(9):RESEARCH0046.
format (e.g., MAGE-TAB) for encoding gene lists and their
[6] Rayner TF, Rocca-Serra P, et al.. (2006). A simple spreadsheet-based,
context that could be easily converted into the SW format. MIAME-supportive format for microarray data: MAGE-TAB. BMC
Bioinformatics. 7:489.

V. CONCLUSION [7] Gollub J, Ball CA, et al.. (2003). The Stanford Microarray Database: data
access and quality assessment tools. Nucleic Acids Res. 31(1):94-6.
We describe and illustrate with a case study the beneficial
[8] Cheung KH, White K, et al.. (2002). YMD: a microarray database for
role of Semantic Web technologies in ‘omic’ data large-scale gene expression analysis. Proc AMIA Symp. 2002:140-4.
representation by providing and querying a data model to
capture provenance information related to reporting [9] Manduchi E, Grant GR, et al.. (2004). RAD and the RAD Study-
Annotator: an approach to collection, organization and exchange of all
microarray experiment results. We have tackled not only the relevant information for high-throughput gene expression studies.
engineering aspect of the data integration problem, but also the Bioinformatics. 20(4):452-9.
more fundamental issues of federating data that begin with
[10] Parkinson H, Sarkans U, et al.. (2005). ArrayExpress--a public repository
seemingly homogeneous data sources (microarray databases) for microarray gene expression data at the EBI. Nucleic Acids Res.
and extends to heterogeneous data domains at multiple levels. 33(Database issue):D553-5.
This is also driven by the growing collaboration between a
[11] Berners-Lee T, Hendler J, Lassila O. (2001). The Semantic Web.
wide spectrum of scientific disciplines and communities such Scientific American. 284(5):34-43
as is required for translational research. We have used a
bottom-up approach that facilitated the identification of four [12] Wang X, Gorlitsky R, Almeida JS. (2005) From XML to RDF: how
provenance levels necessary to report microarray experiment semantic web technologies will change the design of 'omic' standards.
Nat Biotechnol. 23(9):1099-103.
results and shielded our data model from becoming dependent
on constantly evolving ontologies. We have, however, [13] Dunckley T, Beach TG, et al.. (2006). Gene expression correlates of
discussed how some of the terms and relationships from neurofibrillary tangles in Alzheimer's disease. Neurobiol Aging;27:
1359-71.
existing provenance ontologies can be mapped to our model.
Some issues found to be necessary in the integration of [14] Liang WS, Dunckley T, et al.. (2007). Gene expression profiles in
microarray data sources could also be considered relevant for anatomically and functionally distinct regions of the normal aged human
brain. Physiol Genomics 28: 311-22.
the federation of data sources in general. As more ‘omics’ data
are generated, the complexity and requirements for discovery- [15] Liang WS, Reiman EM, et al.. (2008). Alzheimer's disease is associated
based research increases. As a result, there is a growing with reduced expression of energy metabolism genes in posterior
cingulate neurons. Proc Natl Acad Sci U S A l2008;105: 4441-6.
demand for effective data provenance and integration at many
levels that counts on the active involvement of scientists and [16] Alexander K, Cyganiak R, Hausenblas M, and Zhao J. Describing linked
informaticians. Our work represents a step in this direction. datasets. In Linked Data on the Web Workshop in the International
World Wide Web Conference, Madrid, Spain, 2009 .
[17] Hartig O, Zhao J. Publishing and consuming provenance metadata on the
ACKNOWLEDGMENT web of linked data. In Proceedings of The third International
The authors would like to express their sincere gratitude to Provenance and Annotation Workshop, Troy, NY, U.S.A, 2010. In press
the HCLS IG for helping to organize and coordinate with [18] Kotecha N, Bruck K, Lu W, Shah N. Pathway knowledge base: An
different task forces including the BioRDF task force within integrated pathway resource using BioPAX. Applied Ontology. 3(4);
which the neuroscience microarray use case was explored. Also 235-245. 2008
thanks to Helen Parkinson, James Malone, Misha Kapushesky, [19] Missier P, Sahoo S, Zhao J, Goble C and Sheth A. Janus: Semantic
Jonas Almeida and three anonymous reviewers. MSM Provenance Infrastructure for Taverna. In Proceedings of The third
appreciated the support of Jelle Goeman (LUMC) during this International Provenance and Annotation Workshop, Troy, NY, U.S.A,
work. 2010. In press
[20] Altintas I, Anand M, et al.. Understanding Collaborative Studies Through
Interoperable Workflow Provenance. In Proceedings of The third
REFERENCES International Provenance and Annotation Workshop, Troy, NY, U.S.A,
[1] Stears RL, Martinsky T, Schena M. (2003). Trends in microarray analysis. 2010. In press
Nature Medicine. (9): 140 – 145.
[21] McCusker J. and McGuinness D. Explorations into the Provenance of
[2] Barrett T, Troup DB, et al.. (2009). NCBI GEO: archive for high- High Throughput Biomedical Experiments. In Proceedings of The third
throughput functional genomic data. Nucleic Acids Res. 2009 International Provenance and Annotation Workshop, Troy, NY, U.S.A,
Jan;37(Database issue):D885-90. 2010. In press
[3] Lukk M, Kapushesky M, et al.. A global map of human gene expression.
Nat Biotechnology 28, 322-324 (2010)