Provenance of Microarray Experiments for a Better Understanding of Experiment Results Helena F. Deus Department of Bioinformatics and Computational Biology Eric Prud’hommeaux The University of Texas M. D. Anderson Cancer Center World Wide Web Consortium Houston, USA MIT Instituto de Tecnologia Química e Biológica, UNL Cambridge, USA Lisboa, Portugal Michael Miller Jun Zhao Tantric Designs Deparment of Zoology Seattle, USA University of Oxford Oxford, UK * M.Scott Marshall Department of Medical Statistics and Bioinformatics Satya Sahoo Leiden University Medical Center Kno.e.sis Center Leiden, The Netherlands Department of Computer Science and Engineering Informatics Institute Wright State University University of Amsterdam Dayton, USA Amsterdam, The Netherlands Mathias Samwald * Kei-Hoi Cheung Digital Enterprise Research Institute Center for Medical Informatics National University of Ireland Galway Yale University School of Medicine Galway, Ireland New Haven, USA Abstract—This paper describes a Semantic Web (SW) model for According to [1], the workflow of a microarray experiment is gene lists and the metadata required for their practical divided into the following steps: i) experimental design that interpretation. Our provenance information captures the context includes the type of biological questions the experiment is of experiments as well as the processing and analysis parameters designed to address, how the experiment is implemented (e.g., involved in deriving the gene lists from DNA microarray experiment and control), sample preparation, microarray experiments. We demonstrate a range of practical neuroscience queries which draw on the proposed model. Our provenance platform selection, hybridization process, and scanning; ii) representation includes the origins of the gene list and basic data extraction, which includes image quantification, information about the data set itself (e.g. last modification date filtering, and normalization; and iii) data analysis and and original data source), in order to facilitate the federation of modeling, which include approaches such as clustering, t- gene lists with other types of Semantic Web-formatted data and tests, enrichment analysis and so on. include the integration of a broader molecular context through The gene lists produced in step iii are usually reported as additional omics data. part of the experimental results published in scientific papers, and the steps involved in obtaining the gene lists are described Keywords-data integration, query federation, semantic web in the methods section. Sometimes, gene lists are made electronically available (e.g., spreadsheets) through journal I. INTRODUCTION web sites. However, to the best of our knowledge, there is no In the genomics/post-genomics era, massive amounts of standard format for uniformly representing and broadly data generated by high throughput experiments, including sharing such gene lists in a focused scientific context. those using microarray technologies, have presented both We believe it would be useful to the community if such promises and challenges to clinical, and translational research. gene lists were commonly represented in a standard SW One goal of microarray experiments is to discover, out of tens vocabulary and accessible to SW applications. This approach of thousands of genes, a small subset of genes (usually on the makes it possible for researchers to work with the gene list order of hundreds) whose expression pattern is indicative of without requiring a post hoc significance analysis to re-derive some biological response to a given experimental condition. the list. If experimental factors are included with gene lists, Many computational/statistical approaches have been researchers can account for context without requiring labor- developed to detect such biologically significant gene lists. intensive manual research into the experimental factors for *These authors contributed equally to this work. KC is supported in part by NIH grant U24 NS051869. JZ is supported by EPSRC grant EP/G049327/1. HFD is supported by the portuguese FCT (Fundação para a Ciência e Tecnologia) scholarship SFRH/BD/45963/2008. The work of MS was funded by the Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2) and by a postdoctoral fellowship from the Konrad Lorenz Institute for Evolution and Cognition Research, Austria. each microarray study. A standard representation can be used The Semantic Web [11] has been actively explored in the both for gene lists reported in individual papers (note that context of biomedicine. For example, the W3C Semantic Web these published gene lists are not yet stored in most microarray Health Care and Life Sciences Interest Group (HCLS IG) databases) and those computed from datasets collected from (http://www.w3.org/2001/sw/hcls/) represents a major multiple microarray experiments across different microarray community effort involving both academia and industry. The databases (e.g., GEO profiles [2] and Gene Expression Atlas HCLS IG and allied efforts provide a growing corpus of [3]). biomedical datasets expressed in the Resource Description Integrated analysis (meta-analysis) requires raw and Framework (RDF) and web ontology language (OWL). Wang et processed datasets from independent microarray experiments al [12] has described how the transition from the eXtended to be selected, compared, combined, and correlated using a Markup Language (XML) to RDF could potentially enhance variety of computational/statistical methods. This is, of course, semantic representation and integration of omic data. In much easier with machine-readable provenance and addition to data, biomedical ontologies are made available to experimental context. To this end, MIAME [4] was proposed the community through organizations such as NCBO by the Microarray Gene Expression Data (MGED (http://www.bioontology.org/) and OBO Foundry (http://www.mged.org)) community (now called “Functional (http://www.obofoundry.org/). Genomics Data Society” or FGED) to describe the Minimum Information About a Microarray Experiment (MIAME) that is In this paper we explore using SW to represent microarray needed to enable the interpretation of the results of the experimental data and provenance information about the context experiment unambiguously and potentially to reproduce the under which the data were generated, including the goal of the experiment. MIAME represents a set of guidelines for experiment, experimental factors (such as the disease or the cell microarray databases and data management software. The region), and the statistical analysis process which leads to the MAGE data model and MAGE-ML (a standard XML format experiment results. We explore the role of provenance for serializing the MAGE model) [5] have been developed information in helping biologists understand microarray based on the MIAME data content specifications. In addition, experiments in the context of other experiments as well as other MAGE-TAB [6] was proposed as a (more user-friendly) existing biomedical knowledge. To facilitate a quality-aware alternative to MAGE-ML. federation of microarray experiment results, we also provide Along with the development of these standards, a provenance information about the gene lists data published significant number of microarray databases ranging from using SW standards. As a pilot study, we take a bottom-up individual labs (e.g., Nomad at deRisi lab (http://ucsf- approach focusing on the type of provenance information nomad.sourceforge.net/)), institutions (e.g., SMD 7], YMD required to meet our motivation use cases and creating a [8], and RAD [9]) to the scientific community (e.g., GEO [2] representation model with the minimum set of terms to meet and ArrayExpress [10]) have been created, making large these use cases. Although these terms are currently defined in collections of microarray datasets accessible to the public. our own namespaces, they can largely be mapped to existing There are also microarray databases that serve the needs of provenance vocabularies, which are generically defined and specific biomedical domains (e.g., the NIH Neuroscience evolving, to achieve maximum interoperability, in the next Microarray Consortium (http://np2.ctrl.ucla.edu/np2/ stage of our pilot study. home.do)). Major journal publishers have promoted sharing of microarray data by requiring authors to submit their data to II. MOTIVATION public microarray repositories. Some journal publishers make One motivation of microarray experiments is to identify supplemental data available on their web sites. genes that are differentially expressed in biological samples While many microarray databases are MIAME-compliant, under different conditions (e.g., disease vs. control). The several challenges still remain for researchers wishing to samples may come from tissues extracted from different locate datasets relevant to their interest: organs or parts of the same organ (e.g., different brain regions). In this case, we may be able to discover differentially • There is no central repository for all microarray datasets, expressed genes in each organ/organ part and how disease and experiment/dataset are stored on multiple databases. may affect each organ/organ part at the gene expression level. • Users must learn to use different search interfaces and A common outcome of experiments is a list of candidate genes analytic facilities at each database. which may serve as diagnostic or therapeutic markers. These • Many databases lack experimental context, annotation, gene lists, abundant in biomedical literature, are provided in and provenance. heterogeneous formats (e.g., Excel spreadsheets and printed • There is a lack of use of standard vocabularies in many tables embedded in papers) that hinder the reuse of the results. microarray databases. In order to reuse such gene lists in additional pathway or • The lists of differentially expressed genes discussed by molecular analysis, it is important that they are represented in most articles associated with a microarray study are not a standardized, distributable, and machine-readable format that disclosed in any standard format, nor are they is amenable to semantic queries. programmatically accessible. After obtaining a representative list of differentially expressed genes, scientists may need to study these experiment results in a broader molecular context with additional data. In the case of neurological disease studies • Q4: What other diseases may be associated with the same such as Alzheimer's Disease (AD), researchers may want to genes found to be linked to AD? combine gene expression data from multiple AD microarray • Q5: What drugs are known that affect the same studies. For example, one characterization of AD is the overexpressed gene products and what are their target formation of intracellular neurofibrillary tangles that affect diseases? neurons in brain regions involved in the memory function. It is • Q6: Select all the genes determined to be differentially important to have meta-data such as the cell type(s), cell expressed in the Entorhinal cortex in experiments performed histopathology, and brain region(s) for comparing/integrating by AD investigators at the Translational Genomics Research the results across different AD microarray experiments. It is Institute important also to consider the (raw) data source and the types For these types of questions, the microarray experiment of analysis performed on the data to arrive at meaningful results need to be federated (Q4, Q5) or combined (Q6) with interpretations. Finally, gene expression data may be other datasets describing the data itself. We show how the combined with other types of data including genomic structured representation of microarray experiment data and functions, pathways, and associated diseases to broaden the associated provenance metadata will enable us to query across spectrum of integrative data analysis. different aspects of domain knowledge about these experiment In our pilot study, we selected three microarray results using several other datasets in the HCLS KB. We also experiments from different journals ([13-15]) to explore how show how we can provide additional provenance information to represent gene list experiment results in a structured format about different datasets to support some quality-aware and what types of metadata can better enable the computer to federation queries over distributed data sources. search for genes that may play a molecular role in the pathogenesis of AD. All the gene lists from the selected III. METHODS publications were derived from human brain samples that were To address questions Q0-Q3 we need both a precise prepared for AD studies. We wanted to be able to answer a representation of the gene lists reported in the three selected variety of user questions regarding semantically related publications and a representation of the provenance of these experiments and their experimental results. For example: gene lists, such as the methods and procedures involved in • Q0: What microarray experiments analyze samples taken their generation. As mentioned in Section I, several standards from the Entorhinal cortex region of Alzheimer's patients? exist for describing microarray experiment protocols, • Q1: Was the same data normalization algorithm or however, none is comprehensive enough to fully capture the statistical software package used in both studies that complex process of reporting the results of a microarray analyze gene expression in the entorhinal cortex region of experiment. To answer questions Q4-Q5 we need to query AD patients? across the exemplar datasets, using provenance information of • Q2: What genes are overexpressed in the Entorhinal different levels of granularity, from the basic information cortex region in the context of Alzheimer's and what is about the context of each experiment to details about the their expression fold change and associated p-value? analysis processes generating the gene expression results. • Q3: Are there any genes that are expressed differently in Although a number of provenance vocabularies, such as the two different brain regions (such as in Hippocampus and open provenance model (OPM, http://openprovenance.org/) Entorhinal cortex)? and Provenir (http://wiki.knoesis.org/index.php/ The MIAME standard outlines the minimum set of Provenir_Ontology) are available, we choose a bottom-up information that is needed for describing microarray approach in this pilot study. On the one hand, at the time of experiments in order to facilitate the reproduction of these the writing, little was known about how to choose between experiments and a uniform interpretation of experiment these existing vocabularies to best suit our purpose; on the results. Experiments recording and publishing MIAME- other hand, our pilot study aims to focus on capturing the compliant experimental protocol should contain sufficient minimum information to answer our case study questions. information to answer questions like Q0 and Q1. However, This approach has the added advantage of shielding our model because MIAME does not specify a format, and MAGE-ML from having to keep pace with rapidly evolving ontologies and MAGE-TAB do not specify a standard representation for while still enabling mapping to upper level ontologies in the experiment results (such as the set of genes showing particular future. For these reasons, our data model includes the expression patterns), there is no simple mechanism to find minimum set of terms necessary to describe the three semantically related experimental results based on the patterns examples selected, and is made available under our own local of differentially expressed genes. namespace: In order to answer questions Q2 and Q3, it is necessary to model both experimental information (ex: Entorhinal cortex) @prefix biordf:http://purl.org/net/biordfmicroarray/ns# and statistical data (e.g. the p-values associated with gene expression values). Compared with provenance vocabularies, many domain Additionally, we want to be able to extend the knowledge specific ontologies are much more established and stable, such about genes linked to AD such that scientists can access and as NIF (http://www.neuinfo.org/), disease ontology (DO, extend their understandings about their gene expression data http://do-wiki.nubic.northwestern.edu/index.php/Main_Page), analysis results to answer questions like the following: or the voiD vocabulary [16]. Therefore, we reuse terms from these ontologies that are already widely used to annotate directly executed or copied and performed locally using (biological) datasets in our data model in order to enable software such as SWObjects (https://sourceforge.net/ maximum interoperability with other approaches. projects/swobjects/files/). The demo site also includes a diagram explaining the four provenance levels and the types of A. The Data Model data entailed in each level. Our data model captures the minimum information To answer Q0, experiments performed in samples collected necessary to describe the gene lists and the microarray from patients with Alzheimer’s disease in a specific area of the experiment context in which they were generated. To answer brain, the Entorhinal cortex, must be selected from the RDF each of the individual case study questions, different aspects representation. The data necessary to answer to this question is of each dataset had to be considered. For example, to answer completely entailed in the experimental provenance level and questions like Q0 and Q3 a good overview of each microarray can be formulated in terms of the entities used to represent experiment is necessary, including the samples used, the each step of the workflow involved in collecting a Sample. disease of interest, microarray platform, etc. For questions like Making use of data from the statistical analysis provenance Q1 and Q2, however, a different set of assertions concerned level, the same query Q0 can be amended to filter the list of specifically with comparing gene expression quantification experiments retrieved based on the statistical normalization methods in different settings is required. Finally, the ability to software thus enabling an answer to Q1. To answer questions answer questions like Q4 and Q5 involve the more complex Q2 and Q3 data pertaining to the experiment provenance level component of performing simultaneous queries on more than must also be combined with information about the gene lists, one data source. As such, information describing the metadata such as the expression level for each gene. A common associated with each data source is also necessary. To requirement to measure statistical significance of differentially accommodate these different data types in our model, we have expressed genes is the p-value that is associated with gene defined four provenance levels, with each level entailing expression fold change. In Q2, this information is used to trim different subsets of information: the list of over-expressed genes by indicating that fold change Institutional level: Includes assertions about the laboratory > 0 but only in cases where the p-value is < 0.001. where the experiments were performed and the reference One of the most significant advantages of representing gene where the results were published to help determine the lists in RDF is helping scientists enrich it with data from trustworthiness of the data. This information is useful to linked datasets such that questions like Q4 and Q5 may be constrain the list of significant genes to only those that are answered. The dataset description provenance level enables published in peer-reviewed articles and/or were performed at the discovery of useful datasets for specific purposes, such as, certain institutions that have the track record of generating e.g. using the HCLS Kb to discover diseases that may be high quality microarray data published in respected journals. associated with specific genes. Q4, detailed below, achieves Experiment protocol level: Includes assertions about the that goal by first retrieving the same list of genes as in Q2 and, brain regions from which the samples were gathered and the secondly, by selecting the most recently updated SPARQL histology of the cells. Such information has been partially service which includes assertions about both genes and mapped to MGED, DO and NIF terms. diseases. The final section queries this service to retrieve the Data analysis and significance level: Includes assertions correlated diseases. about the statistical analysis methodology for selecting the relevant genes. Terms defined for this level are also provided SELECT DISTINCT ?diseaseName ?geneLabel ?geneName WHERE { as a separate statistic module (http://purl.org/net/ #Retrieve a list of overexpressed genes in the entorhinal cortex of AD biordfmicroarray/stat#) to describe software tools and patients { statistical terms. ?experimentSet dct:isPartOf ?microarray_experiment ; Dataset description level: Includes assertions about when the biordf:has_input_value ?sampleList ; dataset is published, based on which version of a source biordf:differentially_expressed_gene ?gene ; dataset, and who published the dataset. Some existing biordf:has_ouput_value ?foldChange . vocabularies for describing RDF datasets on the Web were ?sampleList biordf:derives_from_region ?brainRegion ; reused to enhance their trustworthiness such as the Vocabulary biordf:patients_have_disease ?alzheimers . of Interlinked Dataset (voiD) [16] that provide basic ?gene rdfs:label ?geneLabel ; information about who published the data as well as a summary biordf:name ?geneName . of the content of the dataset, such as the number of genes ?foldChange rdf:value ?foldChangeValue ; described by the dataset or the SPARQL endpoint through stat:p_value ?pval . which the dataset can be accessed. The Provenance Vocabulary #Apply filters to constrain the amount of results [17] was also used to provide a richer set of provenance FILTER (xsd:float(?foldChangeValue) > 0) information, such as when the dataset is published, using which FILTER (xsd:float(?pval) < 0.001 ) tool, or by accessing which data server. FILTER (?brainRegion = neurolex:Entorhinal_cortex ) FILTER (?alzheimers = doid:DOID_10652 ) B. Formulation of SPARQL queries } The queries described here are formulated at our demo site #Find most recently updated SPARQL endpoint that contains information about genes and diseases. (http://purl.org/net/biordfmicroarray/demo), where they can be { ?source rdf:type void:Dataset ; demonstrated by query Q4. In the case of Linked Open Data, void:sparqlEndpoint ?srvc ; the set of best practices for exposing data as RDF through a dct:issued ?issued ; SPARQL endpoint, researchers often need to distinguish dct:subject diseasome:diseases ; between multiple RDF renderings (i.e. representations) of the dct:subject diseasome:genes . same data set or different versions of it. Different endpoints OPTIONAL { ?source1 rdf:type void:Dataset ; can be discovered by issuing queries that target the data void:sparqlEndpoint ?srvc2 ; sources themselves: When was the last RDF rendering created dct:issued ?issued2 ; and by whom (or which project)? Which dct:subject diseasome:diseases ; ontologies/vocabularies were used? The same standardized dct:subject diseasome:genes . SW mechanisms of reasoning and pattern matching can be FILTER (?issued2 > ?issued) applied to select a specific data source as the ones used to } discover related facts across the data sources. FILTER (!BOUND(?srvc2)) The provenance data model developed for reporting } microarray experiment results while capturing different types #Get associated diseases from most recently updated Diseasome server. of provenance information was motivated by our user-defined SERVICE ?srvc2 { queries. We have therefore applied a bottom-up approach that ?diseasomeGene rdfs:label ?geneLabel . focused on describing the data first before mapping it to ?disease diseasome:associatedGene ?diseasomeGene. widely used ontologies. Although several provenance ?disease rdfs:label ?diseaseName . ontologies are available, some of them are upper level } } ontologies, such as Provenir, therefore lacking the specific terms required for describing how gene lists were derived. Finally, to answer Q6 data from the institutional Other ontologies, such as the Provenance Vocabulary for provenance level we must limit the list of retrieved experiments to those that were performed at a specific Linked Data and proof markup language, were created for institution. The queries presented here are executable through specific application domains, such as explaining reasoning our demo at http://purl.org/net/biordfmicroarray/demo. Their results. Our bottom-up approach enabled us to identify and time to execution ranges between 100 and 200 ms for local define the minimum set of provenance terms to answer a set of queries (Q1-Q3, Q6) and a few seconds (2-5s) for federated queries from different perspectives and shield the data model queries (Q4-Q5) executed using SWObjects. from depending on external vocabularies which are often subject to changes. For increased interoperability, mapping C. Availability terms from our model to terms from a community provenance model, such as the OPM or others is straightforward. For The RDF representation was generated using JavaScript example, our property biordf:has_input_value can be made a and the data was loaded into a public SPARQL endpoint sub-property of the inverse of OPM property used, and (http://purl.org/net/biordfmicroarray/sparql). We elaborate and biordf:derives_from_region can become a sub-property of further expand the provenance queries in this paper at our OPM property wasDerivedFrom. demo site http://purl.org/net/biordfmicroarray/demo. A figure Further down the pipeline of microarray studies, associating each of the four provenance levels with the data bioinformaticians will often need to combine knowledge about that they are concerned with is also made available at the the genes derived from their microarray experiments in order demo site. The complete RDF/turtle representation can be to achieve a deeper understanding at a systems biology level. downloaded from http://biordfmicroarray.googlecode.com/ Although the number of genes that has to be taken into files/all3_genelists_provenance.ttl. The JavaScript code to consideration while studying Alzheimer’s has been convert Excel spreadsheets into RDF is available at significantly reduced by many gene expression studies, a good http://code.google.com/p/biordfmicroarray/ . number of genes (ranging from tens to hundreds) are yet to be processed. One approach becoming increasingly popular is the use of scientific workflow workbenches (such as Taverna and IV. DISCUSSION Kepler) to perform large scale data analysis. Many such A data model to explicitly make the content and context of workbenches [19-20] also record the workflow provenance gene lists (e.g., differentially expressed genes) available in information about, for example, what genes from which RDF format was developed. In the process, four types of organism were processed and how the proteins encoded by the provenance were identified that were found necessary to genes were discovered by querying various genomic characterize, discover, reproduce, compare and integrate gene databases. Combining this workflow provenance information lists with other data. Expressing provenance in RDF enables and the set of microarray experiment-related provenance describing the data itself (i.e. its origin, version and URL information by mapping both to a common community location) in the same language as the elements represented provenance model, such as OPM, the trustworthiness and therein. The power of this uniform access to data and metadata reproducibility of experiment results would be increased should not be underestimated. In practice, this means that throughout the whole experiment life cycle. McCusker et al. SPARQL queries can express constraints both about the [21] has taken a first step towards by providing a tentative origins of the data and contents (or attributes) of the data as translation from MGED-TAB to the OPM. While we endorse the use of SW technologies as the [4] Brazma A, Hingamp P, et al.. (2001). Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. standard machine-readable format, we acknowledge that most Nat Genet. 29(4):365-71. biologists are not familiar with SW and prefer to use formats such as Excel spreadsheets to work with gene list results. To [5] Spellman PT, Miller M, et al.. (2002). Design and implementation of microarray gene expression markup language (MAGE-ML). Genome this end, it would be useful to use a standardized user-friendly Biol. 3(9):RESEARCH0046. format (e.g., MAGE-TAB) for encoding gene lists and their [6] Rayner TF, Rocca-Serra P, et al.. (2006). A simple spreadsheet-based, context that could be easily converted into the SW format. MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinformatics. 7:489. V. CONCLUSION [7] Gollub J, Ball CA, et al.. (2003). The Stanford Microarray Database: data access and quality assessment tools. Nucleic Acids Res. 31(1):94-6. We describe and illustrate with a case study the beneficial [8] Cheung KH, White K, et al.. (2002). YMD: a microarray database for role of Semantic Web technologies in ‘omic’ data large-scale gene expression analysis. Proc AMIA Symp. 2002:140-4. representation by providing and querying a data model to capture provenance information related to reporting [9] Manduchi E, Grant GR, et al.. (2004). RAD and the RAD Study- Annotator: an approach to collection, organization and exchange of all microarray experiment results. We have tackled not only the relevant information for high-throughput gene expression studies. engineering aspect of the data integration problem, but also the Bioinformatics. 20(4):452-9. more fundamental issues of federating data that begin with [10] Parkinson H, Sarkans U, et al.. (2005). ArrayExpress--a public repository seemingly homogeneous data sources (microarray databases) for microarray gene expression data at the EBI. Nucleic Acids Res. and extends to heterogeneous data domains at multiple levels. 33(Database issue):D553-5. This is also driven by the growing collaboration between a [11] Berners-Lee T, Hendler J, Lassila O. (2001). The Semantic Web. wide spectrum of scientific disciplines and communities such Scientific American. 284(5):34-43 as is required for translational research. We have used a bottom-up approach that facilitated the identification of four [12] Wang X, Gorlitsky R, Almeida JS. (2005) From XML to RDF: how provenance levels necessary to report microarray experiment semantic web technologies will change the design of 'omic' standards. Nat Biotechnol. 23(9):1099-103. results and shielded our data model from becoming dependent on constantly evolving ontologies. We have, however, [13] Dunckley T, Beach TG, et al.. (2006). Gene expression correlates of discussed how some of the terms and relationships from neurofibrillary tangles in Alzheimer's disease. Neurobiol Aging;27: 1359-71. existing provenance ontologies can be mapped to our model. Some issues found to be necessary in the integration of [14] Liang WS, Dunckley T, et al.. (2007). Gene expression profiles in microarray data sources could also be considered relevant for anatomically and functionally distinct regions of the normal aged human brain. Physiol Genomics 28: 311-22. the federation of data sources in general. As more ‘omics’ data are generated, the complexity and requirements for discovery- [15] Liang WS, Reiman EM, et al.. (2008). Alzheimer's disease is associated based research increases. As a result, there is a growing with reduced expression of energy metabolism genes in posterior cingulate neurons. Proc Natl Acad Sci U S A l2008;105: 4441-6. demand for effective data provenance and integration at many levels that counts on the active involvement of scientists and [16] Alexander K, Cyganiak R, Hausenblas M, and Zhao J. Describing linked informaticians. Our work represents a step in this direction. datasets. In Linked Data on the Web Workshop in the International World Wide Web Conference, Madrid, Spain, 2009 . [17] Hartig O, Zhao J. Publishing and consuming provenance metadata on the ACKNOWLEDGMENT web of linked data. In Proceedings of The third International The authors would like to express their sincere gratitude to Provenance and Annotation Workshop, Troy, NY, U.S.A, 2010. In press the HCLS IG for helping to organize and coordinate with [18] Kotecha N, Bruck K, Lu W, Shah N. Pathway knowledge base: An different task forces including the BioRDF task force within integrated pathway resource using BioPAX. Applied Ontology. 3(4); which the neuroscience microarray use case was explored. Also 235-245. 2008 thanks to Helen Parkinson, James Malone, Misha Kapushesky, [19] Missier P, Sahoo S, Zhao J, Goble C and Sheth A. Janus: Semantic Jonas Almeida and three anonymous reviewers. MSM Provenance Infrastructure for Taverna. In Proceedings of The third appreciated the support of Jelle Goeman (LUMC) during this International Provenance and Annotation Workshop, Troy, NY, U.S.A, work. 2010. In press [20] Altintas I, Anand M, et al.. Understanding Collaborative Studies Through Interoperable Workflow Provenance. In Proceedings of The third REFERENCES International Provenance and Annotation Workshop, Troy, NY, U.S.A, [1] Stears RL, Martinsky T, Schena M. (2003). Trends in microarray analysis. 2010. In press Nature Medicine. (9): 140 – 145. [21] McCusker J. and McGuinness D. Explorations into the Provenance of [2] Barrett T, Troup DB, et al.. (2009). NCBI GEO: archive for high- High Throughput Biomedical Experiments. In Proceedings of The third throughput functional genomic data. Nucleic Acids Res. 2009 International Provenance and Annotation Workshop, Troy, NY, U.S.A, Jan;37(Database issue):D885-90. 2010. In press [3] Lukk M, Kapushesky M, et al.. A global map of human gene expression. Nat Biotechnology 28, 322-324 (2010)