Provenance of Microarray Experiments for a
         Better Understanding of Experiment Results
                          Helena F. Deus
   Department of Bioinformatics and Computational Biology                                        Eric Prud’hommeaux
    The University of Texas M. D. Anderson Cancer Center                                      World Wide Web Consortium
                        Houston, USA                                                                     MIT
      Instituto de Tecnologia Química e Biológica, UNL                                             Cambridge, USA
                       Lisboa, Portugal
                                                                                                   Michael Miller
                               Jun Zhao                                                             Tantric Designs
                       Deparment of Zoology                                                          Seattle, USA
                        University of Oxford
                            Oxford, UK                                                           * M.Scott Marshall
                                                                                   Department of Medical Statistics and Bioinformatics
                           Satya Sahoo                                                     Leiden University Medical Center
                       Kno.e.sis Center                                                        Leiden, The Netherlands
        Department of Computer Science and Engineering                                           Informatics Institute
                    Wright State University                                                    University of Amsterdam
                        Dayton, USA                                                          Amsterdam, The Netherlands

                       Mathias Samwald                                                           * Kei-Hoi Cheung
               Digital Enterprise Research Institute                                         Center for Medical Informatics
              National University of Ireland Galway                                        Yale University School of Medicine
                         Galway, Ireland                                                           New Haven, USA

Abstract—This paper describes a Semantic Web (SW) model for                   According to [1], the workflow of a microarray experiment is
gene lists and the metadata required for their practical                      divided into the following steps: i) experimental design that
interpretation. Our provenance information captures the context               includes the type of biological questions the experiment is
of experiments as well as the processing and analysis parameters              designed to address, how the experiment is implemented (e.g.,
involved in deriving the gene lists from DNA microarray
                                                                              experiment and control), sample preparation, microarray
experiments. We demonstrate a range of practical neuroscience
queries which draw on the proposed model. Our provenance                      platform selection, hybridization process, and scanning; ii)
representation includes the origins of the gene list and basic                data extraction, which includes image quantification,
information about the data set itself (e.g. last modification date            filtering, and normalization; and iii) data analysis and
and original data source), in order to facilitate the federation of           modeling, which include approaches such as clustering, t-
gene lists with other types of Semantic Web-formatted data and                tests, enrichment analysis and so on.
include the integration of a broader molecular context through                    The gene lists produced in step iii are usually reported as
additional omics data.                                                        part of the experimental results published in scientific papers,
                                                                              and the steps involved in obtaining the gene lists are described
    Keywords-data integration, query federation, semantic web
                                                                              in the methods section. Sometimes, gene lists are made
                                                                              electronically available (e.g., spreadsheets) through journal
                          I.     INTRODUCTION                                 web sites. However, to the best of our knowledge, there is no
   In the genomics/post-genomics era, massive amounts of                      standard format for uniformly representing and broadly
data generated by high throughput experiments, including                      sharing such gene lists in a focused scientific context.
those using microarray technologies, have presented both                          We believe it would be useful to the community if such
promises and challenges to clinical, and translational research.              gene lists were commonly represented in a standard SW
One goal of microarray experiments is to discover, out of tens                vocabulary and accessible to SW applications. This approach
of thousands of genes, a small subset of genes (usually on the                makes it possible for researchers to work with the gene list
order of hundreds) whose expression pattern is indicative of                  without requiring a post hoc significance analysis to re-derive
some biological response to a given experimental condition.                   the list. If experimental factors are included with gene lists,
   Many computational/statistical approaches have been                        researchers can account for context without requiring labor-
developed to detect such biologically significant gene lists.                 intensive manual research into the experimental factors for

*These authors contributed equally to this work. KC is supported in part by
NIH grant U24 NS051869. JZ is supported by EPSRC grant
EP/G049327/1. HFD is supported by the portuguese FCT (Fundação para a
Ciência e Tecnologia) scholarship SFRH/BD/45963/2008. The work of MS
was funded by the Science Foundation Ireland under Grant No.
SFI/08/CE/I1380 (Lion-2) and by a postdoctoral fellowship from the Konrad
Lorenz Institute for Evolution and Cognition Research, Austria.
each microarray study. A standard representation can be used              The Semantic Web [11] has been actively explored in the
both for gene lists reported in individual papers (note that          context of biomedicine. For example, the W3C Semantic Web
these published gene lists are not yet stored in most microarray      Health Care and Life Sciences Interest Group (HCLS IG)
databases) and those computed from datasets collected from            (http://www.w3.org/2001/sw/hcls/)     represents  a    major
multiple microarray experiments across different microarray           community effort involving both academia and industry. The
databases (e.g., GEO profiles [2] and Gene Expression Atlas           HCLS IG and allied efforts provide a growing corpus of
[3]).                                                                 biomedical datasets expressed in the Resource Description
    Integrated analysis (meta-analysis) requires raw and              Framework (RDF) and web ontology language (OWL). Wang et
processed datasets from independent microarray experiments            al [12] has described how the transition from the eXtended
to be selected, compared, combined, and correlated using a            Markup Language (XML) to RDF could potentially enhance
variety of computational/statistical methods. This is, of course,     semantic representation and integration of omic data. In
much easier with machine-readable provenance and                      addition to data, biomedical ontologies are made available to
experimental context. To this end, MIAME [4] was proposed             the community through organizations such as NCBO
by the Microarray Gene Expression Data (MGED                          (http://www.bioontology.org/)     and       OBO     Foundry
(http://www.mged.org)) community (now called “Functional              (http://www.obofoundry.org/).
Genomics Data Society” or FGED) to describe the Minimum
Information About a Microarray Experiment (MIAME) that is                 In this paper we explore using SW to represent microarray
needed to enable the interpretation of the results of the             experimental data and provenance information about the context
experiment unambiguously and potentially to reproduce the             under which the data were generated, including the goal of the
experiment. MIAME represents a set of guidelines for                  experiment, experimental factors (such as the disease or the cell
microarray databases and data management software. The                region), and the statistical analysis process which leads to the
MAGE data model and MAGE-ML (a standard XML format                    experiment results. We explore the role of provenance
for serializing the MAGE model) [5] have been developed               information in helping biologists understand microarray
based on the MIAME data content specifications. In addition,          experiments in the context of other experiments as well as other
MAGE-TAB [6] was proposed as a (more user-friendly)                   existing biomedical knowledge. To facilitate a quality-aware
alternative to MAGE-ML.                                               federation of microarray experiment results, we also provide
    Along with the development of these standards, a                  provenance information about the gene lists data published
significant number of microarray databases ranging from               using SW standards. As a pilot study, we take a bottom-up
individual labs (e.g., Nomad at deRisi lab (http://ucsf-              approach focusing on the type of provenance information
nomad.sourceforge.net/)), institutions (e.g., SMD 7], YMD             required to meet our motivation use cases and creating a
[8], and RAD [9]) to the scientific community (e.g., GEO [2]          representation model with the minimum set of terms to meet
and ArrayExpress [10]) have been created, making large                these use cases. Although these terms are currently defined in
collections of microarray datasets accessible to the public.          our own namespaces, they can largely be mapped to existing
There are also microarray databases that serve the needs of           provenance vocabularies, which are generically defined and
specific biomedical domains (e.g., the NIH Neuroscience               evolving, to achieve maximum interoperability, in the next
Microarray        Consortium         (http://np2.ctrl.ucla.edu/np2/   stage of our pilot study.
home.do)). Major journal publishers have promoted sharing of
microarray data by requiring authors to submit their data to                                  II.   MOTIVATION
public microarray repositories. Some journal publishers make               One motivation of microarray experiments is to identify
supplemental data available on their web sites.                        genes that are differentially expressed in biological samples
    While many microarray databases are MIAME-compliant,               under different conditions (e.g., disease vs. control). The
several challenges still remain for researchers wishing to             samples may come from tissues extracted from different
locate datasets relevant to their interest:                            organs or parts of the same organ (e.g., different brain
                                                                       regions). In this case, we may be able to discover differentially
   • There is no central repository for all microarray datasets,       expressed genes in each organ/organ part and how disease
   and experiment/dataset are stored on multiple databases.            may affect each organ/organ part at the gene expression level.
   • Users must learn to use different search interfaces and           A common outcome of experiments is a list of candidate genes
   analytic facilities at each database.                               which may serve as diagnostic or therapeutic markers. These
   • Many databases lack experimental context, annotation,             gene lists, abundant in biomedical literature, are provided in
   and provenance.                                                     heterogeneous formats (e.g., Excel spreadsheets and printed
   • There is a lack of use of standard vocabularies in many           tables embedded in papers) that hinder the reuse of the results.
   microarray databases.                                               In order to reuse such gene lists in additional pathway or
   • The lists of differentially expressed genes discussed by          molecular analysis, it is important that they are represented in
   most articles associated with a microarray study are not            a standardized, distributable, and machine-readable format that
   disclosed in any standard format, nor are they                      is amenable to semantic queries.
   programmatically accessible.                                            After obtaining a representative list of differentially
                                                                       expressed genes, scientists may need to study these
                                                                       experiment results in a broader molecular context with
additional data. In the case of neurological disease studies          • Q4: What other diseases may be associated with the same
such as Alzheimer's Disease (AD), researchers may want to             genes found to be linked to AD?
combine gene expression data from multiple AD microarray              • Q5: What drugs are known that affect the same
studies. For example, one characterization of AD is the               overexpressed gene products and what are their target
formation of intracellular neurofibrillary tangles that affect        diseases?
neurons in brain regions involved in the memory function. It is       • Q6: Select all the genes determined to be differentially
important to have meta-data such as the cell type(s), cell            expressed in the Entorhinal cortex in experiments performed
histopathology, and brain region(s) for comparing/integrating         by AD investigators at the Translational Genomics Research
the results across different AD microarray experiments. It is         Institute
important also to consider the (raw) data source and the types         For these types of questions, the microarray experiment
of analysis performed on the data to arrive at meaningful          results need to be federated (Q4, Q5) or combined (Q6) with
interpretations. Finally, gene expression data may be              other datasets describing the data itself. We show how the
combined with other types of data including genomic                structured representation of microarray experiment data and
functions, pathways, and associated diseases to broaden the        associated provenance metadata will enable us to query across
spectrum of integrative data analysis.                             different aspects of domain knowledge about these experiment
   In our pilot study, we selected three microarray                results using several other datasets in the HCLS KB. We also
experiments from different journals ([13-15]) to explore how       show how we can provide additional provenance information
to represent gene list experiment results in a structured format   about different datasets to support some quality-aware
and what types of metadata can better enable the computer to       federation queries over distributed data sources.
search for genes that may play a molecular role in the
pathogenesis of AD. All the gene lists from the selected                                  III.   METHODS
publications were derived from human brain samples that were          To address questions Q0-Q3 we need both a precise
prepared for AD studies. We wanted to be able to answer a          representation of the gene lists reported in the three selected
variety of user questions regarding semantically related           publications and a representation of the provenance of these
experiments and their experimental results. For example:           gene lists, such as the methods and procedures involved in
    • Q0: What microarray experiments analyze samples taken        their generation. As mentioned in Section I, several standards
    from the Entorhinal cortex region of Alzheimer's patients?     exist for describing microarray experiment protocols,
    • Q1: Was the same data normalization algorithm or             however, none is comprehensive enough to fully capture the
    statistical software package used in both studies that         complex process of reporting the results of a microarray
    analyze gene expression in the entorhinal cortex region of     experiment. To answer questions Q4-Q5 we need to query
    AD patients?                                                   across the exemplar datasets, using provenance information of
    • Q2: What genes are overexpressed in the Entorhinal           different levels of granularity, from the basic information
    cortex region in the context of Alzheimer's and what is        about the context of each experiment to details about the
    their expression fold change and associated p-value?           analysis processes generating the gene expression results.
    • Q3: Are there any genes that are expressed differently in    Although a number of provenance vocabularies, such as the
    two different brain regions (such as in Hippocampus and        open provenance model (OPM, http://openprovenance.org/)
    Entorhinal cortex)?                                            and         Provenir        (http://wiki.knoesis.org/index.php/
   The MIAME standard outlines the minimum set of                  Provenir_Ontology) are available, we choose a bottom-up
information that is needed for describing microarray               approach in this pilot study. On the one hand, at the time of
experiments in order to facilitate the reproduction of these       the writing, little was known about how to choose between
experiments and a uniform interpretation of experiment             these existing vocabularies to best suit our purpose; on the
results. Experiments recording and publishing MIAME-               other hand, our pilot study aims to focus on capturing the
compliant experimental protocol should contain sufficient          minimum information to answer our case study questions.
information to answer questions like Q0 and Q1. However,           This approach has the added advantage of shielding our model
because MIAME does not specify a format, and MAGE-ML               from having to keep pace with rapidly evolving ontologies
and MAGE-TAB do not specify a standard representation for          while still enabling mapping to upper level ontologies in the
experiment results (such as the set of genes showing particular    future. For these reasons, our data model includes the
expression patterns), there is no simple mechanism to find         minimum set of terms necessary to describe the three
semantically related experimental results based on the patterns    examples selected, and is made available under our own local
of differentially expressed genes.                                 namespace:
   In order to answer questions Q2 and Q3, it is necessary to
model both experimental information (ex: Entorhinal cortex)            @prefix biordf:http://purl.org/net/biordfmicroarray/ns#
and statistical data (e.g. the p-values associated with gene
expression values).                                                   Compared with provenance vocabularies, many domain
   Additionally, we want to be able to extend the knowledge        specific ontologies are much more established and stable, such
about genes linked to AD such that scientists can access and       as NIF (http://www.neuinfo.org/), disease ontology (DO,
extend their understandings about their gene expression data       http://do-wiki.nubic.northwestern.edu/index.php/Main_Page),
analysis results to answer questions like the following:           or the voiD vocabulary [16]. Therefore, we reuse terms from
these ontologies that are already widely used to annotate            directly executed or copied and performed locally using
(biological) datasets in our data model in order to enable           software such as SWObjects (https://sourceforge.net/
maximum interoperability with other approaches.                      projects/swobjects/files/). The demo site also includes a
                                                                     diagram explaining the four provenance levels and the types of
A. The Data Model
                                                                     data entailed in each level.
   Our data model captures the minimum information                       To answer Q0, experiments performed in samples collected
necessary to describe the gene lists and the microarray              from patients with Alzheimer’s disease in a specific area of the
experiment context in which they were generated. To answer           brain, the Entorhinal cortex, must be selected from the RDF
each of the individual case study questions, different aspects       representation. The data necessary to answer to this question is
of each dataset had to be considered. For example, to answer         completely entailed in the experimental provenance level and
questions like Q0 and Q3 a good overview of each microarray          can be formulated in terms of the entities used to represent
experiment is necessary, including the samples used, the             each step of the workflow involved in collecting a Sample.
disease of interest, microarray platform, etc. For questions like    Making use of data from the statistical analysis provenance
Q1 and Q2, however, a different set of assertions concerned          level, the same query Q0 can be amended to filter the list of
specifically with comparing gene expression quantification           experiments retrieved based on the statistical normalization
methods in different settings is required. Finally, the ability to   software thus enabling an answer to Q1. To answer questions
answer questions like Q4 and Q5 involve the more complex             Q2 and Q3 data pertaining to the experiment provenance level
component of performing simultaneous queries on more than            must also be combined with information about the gene lists,
one data source. As such, information describing the metadata        such as the expression level for each gene. A common
associated with each data source is also necessary. To               requirement to measure statistical significance of differentially
accommodate these different data types in our model, we have         expressed genes is the p-value that is associated with gene
defined four provenance levels, with each level entailing            expression fold change. In Q2, this information is used to trim
different subsets of information:                                    the list of over-expressed genes by indicating that fold change
Institutional level: Includes assertions about the laboratory        > 0 but only in cases where the p-value is < 0.001.
where the experiments were performed and the reference                   One of the most significant advantages of representing gene
where the results were published to help determine the               lists in RDF is helping scientists enrich it with data from
trustworthiness of the data. This information is useful to           linked datasets such that questions like Q4 and Q5 may be
constrain the list of significant genes to only those that are       answered. The dataset description provenance level enables
published in peer-reviewed articles and/or were performed at         the discovery of useful datasets for specific purposes, such as,
certain institutions that have the track record of generating        e.g. using the HCLS Kb to discover diseases that may be
high quality microarray data published in respected journals.        associated with specific genes. Q4, detailed below, achieves
Experiment protocol level: Includes assertions about the             that goal by first retrieving the same list of genes as in Q2 and,
brain regions from which the samples were gathered and the           secondly, by selecting the most recently updated SPARQL
histology of the cells. Such information has been partially          service which includes assertions about both genes and
mapped to MGED, DO and NIF terms.                                    diseases. The final section queries this service to retrieve the
Data analysis and significance level: Includes assertions            correlated diseases.
about the statistical analysis methodology for selecting the
relevant genes. Terms defined for this level are also provided            SELECT DISTINCT ?diseaseName ?geneLabel ?geneName WHERE {
as a separate statistic module (http://purl.org/net/                      #Retrieve a list of overexpressed genes in the entorhinal cortex of AD
biordfmicroarray/stat#) to describe software tools and               patients
                                                                          {
statistical terms.
                                                                            ?experimentSet dct:isPartOf ?microarray_experiment ;
Dataset description level: Includes assertions about when the
                                                                                            biordf:has_input_value ?sampleList ;
dataset is published, based on which version of a source
                                                                                            biordf:differentially_expressed_gene ?gene ;
dataset, and who published the dataset. Some existing
                                                                                            biordf:has_ouput_value ?foldChange .
vocabularies for describing RDF datasets on the Web were
                                                                            ?sampleList biordf:derives_from_region ?brainRegion ;
reused to enhance their trustworthiness such as the Vocabulary                          biordf:patients_have_disease ?alzheimers .
of Interlinked Dataset (voiD) [16] that provide basic                       ?gene rdfs:label ?geneLabel ;
information about who published the data as well as a summary                      biordf:name ?geneName .
of the content of the dataset, such as the number of genes                  ?foldChange rdf:value ?foldChangeValue ;
described by the dataset or the SPARQL endpoint through                                  stat:p_value ?pval .
which the dataset can be accessed. The Provenance Vocabulary                #Apply filters to constrain the amount of results
[17] was also used to provide a richer set of provenance                      FILTER (xsd:float(?foldChangeValue) > 0)
information, such as when the dataset is published, using which               FILTER (xsd:float(?pval) < 0.001 )
tool, or by accessing which data server.                                      FILTER (?brainRegion = neurolex:Entorhinal_cortex )
                                                                              FILTER (?alzheimers = doid:DOID_10652 )
B. Formulation of SPARQL queries                                          }
   The queries described here are formulated at our demo site             #Find most recently updated SPARQL endpoint that contains information
                                                                     about genes and diseases.
(http://purl.org/net/biordfmicroarray/demo), where they can be
                                                                          {
       ?source rdf:type void:Dataset ;                                     demonstrated by query Q4. In the case of Linked Open Data,
       void:sparqlEndpoint ?srvc ;                                         the set of best practices for exposing data as RDF through a
       dct:issued ?issued ;                                                SPARQL endpoint, researchers often need to distinguish
       dct:subject diseasome:diseases ;
                                                                           between multiple RDF renderings (i.e. representations) of the
       dct:subject diseasome:genes .
                                                                           same data set or different versions of it. Different endpoints
   OPTIONAL {
       ?source1 rdf:type void:Dataset ;
                                                                           can be discovered by issuing queries that target the data
       void:sparqlEndpoint ?srvc2 ;                                        sources themselves: When was the last RDF rendering created
       dct:issued ?issued2 ;                                               and      by     whom      (or     which     project)?   Which
       dct:subject diseasome:diseases ;                                    ontologies/vocabularies were used? The same standardized
       dct:subject diseasome:genes .                                       SW mechanisms of reasoning and pattern matching can be
       FILTER (?issued2 > ?issued)                                         applied to select a specific data source as the ones used to
   }                                                                       discover related facts across the data sources.
   FILTER (!BOUND(?srvc2))                                                    The provenance data model developed for reporting
   }                                                                       microarray experiment results while capturing different types
   #Get associated diseases from most recently updated Diseasome server.   of provenance information was motivated by our user-defined
     SERVICE ?srvc2 {                                                      queries. We have therefore applied a bottom-up approach that
       ?diseasomeGene rdfs:label ?geneLabel .                              focused on describing the data first before mapping it to
       ?disease diseasome:associatedGene ?diseasomeGene.
                                                                           widely used ontologies. Although several provenance
       ?disease rdfs:label ?diseaseName .
                                                                           ontologies are available, some of them are upper level
   }
   }
                                                                           ontologies, such as Provenir, therefore lacking the specific
                                                                           terms required for describing how gene lists were derived.
    Finally, to answer Q6 data from the institutional
                                                                           Other ontologies, such as the Provenance Vocabulary for
provenance level we must limit the list of retrieved
experiments to those that were performed at a specific                     Linked Data and proof markup language, were created for
institution. The queries presented here are executable through             specific application domains, such as explaining reasoning
our demo at http://purl.org/net/biordfmicroarray/demo. Their               results. Our bottom-up approach enabled us to identify and
time to execution ranges between 100 and 200 ms for local                  define the minimum set of provenance terms to answer a set of
queries (Q1-Q3, Q6) and a few seconds (2-5s) for federated                 queries from different perspectives and shield the data model
queries (Q4-Q5) executed using SWObjects.                                  from depending on external vocabularies which are often
                                                                           subject to changes. For increased interoperability, mapping
C. Availability                                                            terms from our model to terms from a community provenance
                                                                           model, such as the OPM or others is straightforward. For
    The RDF representation was generated using JavaScript
                                                                           example, our property biordf:has_input_value can be made a
and the data was loaded into a public SPARQL endpoint
                                                                           sub-property of the inverse of OPM property used, and
(http://purl.org/net/biordfmicroarray/sparql). We elaborate and
                                                                           biordf:derives_from_region can become a sub-property of
further expand the provenance queries in this paper at our
                                                                           OPM property wasDerivedFrom.
demo site http://purl.org/net/biordfmicroarray/demo. A figure
                                                                              Further down the pipeline of microarray studies,
associating each of the four provenance levels with the data
                                                                           bioinformaticians will often need to combine knowledge about
that they are concerned with is also made available at the
                                                                           the genes derived from their microarray experiments in order
demo site. The complete RDF/turtle representation can be
                                                                           to achieve a deeper understanding at a systems biology level.
downloaded from http://biordfmicroarray.googlecode.com/
                                                                           Although the number of genes that has to be taken into
files/all3_genelists_provenance.ttl. The JavaScript code to
                                                                           consideration while studying Alzheimer’s has been
convert Excel spreadsheets into RDF is available at
                                                                           significantly reduced by many gene expression studies, a good
http://code.google.com/p/biordfmicroarray/ .
                                                                           number of genes (ranging from tens to hundreds) are yet to be
                                                                           processed. One approach becoming increasingly popular is the
                                                                           use of scientific workflow workbenches (such as Taverna and
                          IV.     DISCUSSION                               Kepler) to perform large scale data analysis. Many such
    A data model to explicitly make the content and context of             workbenches [19-20] also record the workflow provenance
gene lists (e.g., differentially expressed genes) available in             information about, for example, what genes from which
RDF format was developed. In the process, four types of                    organism were processed and how the proteins encoded by the
provenance were identified that were found necessary to                    genes were discovered by querying various genomic
characterize, discover, reproduce, compare and integrate gene              databases. Combining this workflow provenance information
lists with other data. Expressing provenance in RDF enables                and the set of microarray experiment-related provenance
describing the data itself (i.e. its origin, version and URL               information by mapping both to a common community
location) in the same language as the elements represented                 provenance model, such as OPM, the trustworthiness and
therein. The power of this uniform access to data and metadata             reproducibility of experiment results would be increased
should not be underestimated. In practice, this means that                 throughout the whole experiment life cycle. McCusker et al.
SPARQL queries can express constraints both about the                      [21] has taken a first step towards by providing a tentative
origins of the data and contents (or attributes) of the data as            translation from MGED-TAB to the OPM.
   While we endorse the use of SW technologies as the                          [4] Brazma A, Hingamp P, et al.. (2001). Minimum information about a
                                                                                   microarray experiment (MIAME)-toward standards for microarray data.
standard machine-readable format, we acknowledge that most                         Nat Genet. 29(4):365-71.
biologists are not familiar with SW and prefer to use formats
such as Excel spreadsheets to work with gene list results. To                  [5] Spellman PT, Miller M, et al.. (2002). Design and implementation of
                                                                                   microarray gene expression markup language (MAGE-ML). Genome
this end, it would be useful to use a standardized user-friendly                   Biol. 3(9):RESEARCH0046.
format (e.g., MAGE-TAB) for encoding gene lists and their
                                                                               [6]    Rayner TF, Rocca-Serra P, et al.. (2006). A simple spreadsheet-based,
context that could be easily converted into the SW format.                            MIAME-supportive format for microarray data: MAGE-TAB. BMC
                                                                                      Bioinformatics. 7:489.

                           V.     CONCLUSION                                   [7] Gollub J, Ball CA, et al.. (2003). The Stanford Microarray Database: data
                                                                                    access and quality assessment tools. Nucleic Acids Res. 31(1):94-6.
    We describe and illustrate with a case study the beneficial
                                                                               [8] Cheung KH, White K, et al.. (2002). YMD: a microarray database for
role of Semantic Web technologies in ‘omic’ data                                    large-scale gene expression analysis. Proc AMIA Symp. 2002:140-4.
representation by providing and querying a data model to
capture provenance information related to reporting                            [9] Manduchi E, Grant GR, et al.. (2004). RAD and the RAD Study-
                                                                                    Annotator: an approach to collection, organization and exchange of all
microarray experiment results. We have tackled not only the                         relevant information for high-throughput gene expression studies.
engineering aspect of the data integration problem, but also the                    Bioinformatics. 20(4):452-9.
more fundamental issues of federating data that begin with
                                                                               [10] Parkinson H, Sarkans U, et al.. (2005). ArrayExpress--a public repository
seemingly homogeneous data sources (microarray databases)                            for microarray gene expression data at the EBI. Nucleic Acids Res.
and extends to heterogeneous data domains at multiple levels.                        33(Database issue):D553-5.
This is also driven by the growing collaboration between a
                                                                               [11] Berners-Lee T, Hendler J, Lassila O. (2001). The Semantic Web.
wide spectrum of scientific disciplines and communities such                        Scientific American. 284(5):34-43
as is required for translational research. We have used a
bottom-up approach that facilitated the identification of four                 [12] Wang X, Gorlitsky R, Almeida JS. (2005) From XML to RDF: how
provenance levels necessary to report microarray experiment                         semantic web technologies will change the design of 'omic' standards.
                                                                                    Nat Biotechnol. 23(9):1099-103.
results and shielded our data model from becoming dependent
on constantly evolving ontologies. We have, however,                           [13]    Dunckley T, Beach TG, et al.. (2006). Gene expression correlates of
discussed how some of the terms and relationships from                                neurofibrillary tangles in Alzheimer's disease. Neurobiol Aging;27:
                                                                                      1359-71.
existing provenance ontologies can be mapped to our model.
Some issues found to be necessary in the integration of                        [14] Liang WS, Dunckley T, et al.. (2007). Gene expression profiles in
microarray data sources could also be considered relevant for                       anatomically and functionally distinct regions of the normal aged human
                                                                                    brain. Physiol Genomics 28: 311-22.
the federation of data sources in general. As more ‘omics’ data
are generated, the complexity and requirements for discovery-                  [15] Liang WS, Reiman EM, et al.. (2008). Alzheimer's disease is associated
based research increases. As a result, there is a growing                           with reduced expression of energy metabolism genes in posterior
                                                                                    cingulate neurons. Proc Natl Acad Sci U S A l2008;105: 4441-6.
demand for effective data provenance and integration at many
levels that counts on the active involvement of scientists and                 [16] Alexander K, Cyganiak R, Hausenblas M, and Zhao J. Describing linked
informaticians. Our work represents a step in this direction.                       datasets. In Linked Data on the Web Workshop in the International
                                                                                    World Wide Web Conference, Madrid, Spain, 2009 .
                                                                               [17] Hartig O, Zhao J. Publishing and consuming provenance metadata on the
                         ACKNOWLEDGMENT                                             web of linked data. In Proceedings of The third International
    The authors would like to express their sincere gratitude to                    Provenance and Annotation Workshop, Troy, NY, U.S.A, 2010. In press
the HCLS IG for helping to organize and coordinate with                        [18] Kotecha N, Bruck K, Lu W, Shah N. Pathway knowledge base: An
different task forces including the BioRDF task force within                        integrated pathway resource using BioPAX. Applied Ontology. 3(4);
which the neuroscience microarray use case was explored. Also                       235-245. 2008
thanks to Helen Parkinson, James Malone, Misha Kapushesky,                     [19] Missier P, Sahoo S, Zhao J, Goble C and Sheth A. Janus: Semantic
Jonas Almeida and three anonymous reviewers. MSM                                    Provenance Infrastructure for Taverna. In Proceedings of The third
appreciated the support of Jelle Goeman (LUMC) during this                          International Provenance and Annotation Workshop, Troy, NY, U.S.A,
work.                                                                               2010. In press
                                                                               [20] Altintas I, Anand M, et al.. Understanding Collaborative Studies Through
                                                                                    Interoperable Workflow Provenance. In Proceedings of The third
                              REFERENCES                                            International Provenance and Annotation Workshop, Troy, NY, U.S.A,
[1] Stears RL, Martinsky T, Schena M. (2003). Trends in microarray analysis.        2010. In press
      Nature Medicine. (9): 140 – 145.
                                                                               [21] McCusker J. and McGuinness D. Explorations into the Provenance of
[2] Barrett T, Troup DB, et al.. (2009). NCBI GEO: archive for high-                High Throughput Biomedical Experiments. In Proceedings of The third
     throughput functional genomic data. Nucleic Acids Res. 2009                    International Provenance and Annotation Workshop, Troy, NY, U.S.A,
     Jan;37(Database issue):D885-90.                                                2010. In press
[3] Lukk M, Kapushesky M, et al.. A global map of human gene expression.
     Nat Biotechnology 28, 322-324 (2010)