Provenance and Linked Data in Biological Data Webs Jun Zhao, Graham Klyne and David Shotton Image Bioinformatics Research Group Department of Zoology University of Oxford Oxford OX1 3PS, UK {jun.zhao,graham.klyne,david.shotton}@zoo.ox.ac.uk ABSTRACT To simplify this process, the Image Bioinformatics Re- To created a linked data web of heterogeneous biological search Group (IBRG)1 of the University of Oxford proposes data resources, we need not only to define and create the the use of subject-specific data webs, which use the Web alignment between related data resources but also to ex- as the native platform upon which to integrate access to press the knowledge about why data items from different datasets relating to particular subjects [7]. Within each sources are linked with each other and how each data link data web, data resources are integrated using loosely cou- has evolved, so that scientists can trust the data links pro- pled software tools that permit both information discovery vided by the data web. This paper highlights the importance and links back to the original data. With this approach, of keeping provenance information about the links between the data linked into the data webs are neither required to data items from different sources, and proposes the use of be semantically coordinated nor constrained to conform to named graphs to make a provenance statement about each a single imposed model. Furthermore, copyright and ac- pair of linked data items and each release of a data web. cess control issues remain the concern of the data sources, not of the data web that unites them. These data sources maintain their unique characters and continue independent Categories and Subject Descriptors publication of their holdings. D.2.12 [Software Engineering]: Interoperability—Data map- The first demonstrator data web being developed by IBRG ping; H.3.5 [Information Storage And Retrieval]: On- is FlyWeb2 , which will integrate the heterogeneous data re- line Information Services sources concerning research on fruit fly Drosophila melano- gaster. These data resources include FlyTED3 (our local research image repository concerning gene expression in the General Terms testis of fruit flies), BDGP4 (the Berkeley Drosophila Genome Language, Reliability Project database concerning gene expression in the Drosophila embryos), FlyBase5 (the global database of genomics infor- mation concerning Drosophila), and online research publi- Keywords cations on Drosophila gene expressions. The goal of Fly- Data Web, Named Graphs, Provenance, RDF, Semantic Web is to allow biologists to obtain information about a Web, Trust Drosophila gene, including the gene expression images of its testis and embryos, without having to hop between the Drosophila data islands on the Web. 1. INTRODUCTION To build FlyWeb, we need not only to define and im- The number of biology databases available has increased plement the alignment between Drosophila data resources, rapidly in the recent years [4]. To obtain knowledge about a but also to maintain the data links between related data gene or protein from this sea of data, biologists often need to items from different sources. This position paper focuses on go through an information gathering process, navigating be- the second issue, and will analyze the motivation for keep- tween the public genomic and publication databases. These ing provenance of the links between related data items and resources are scattered around the world and present data present our proposed solution. in heterogeneous formats. Scientists have to rely on their domain knowledge in order to identify how data resources 2. SEMANTIC WEB AND FLYWEB are linked with each other. The initial development goals of the FlyWeb Project in- clude understanding the distributed Drosophila data resources; creating the alignment between them; and creating a query service to access the integrated data resources. At the time Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are 1 http://ibrg.zoo.ox.ac.uk/ not made or distributed for profit or commercial advantage and that copies 2 http://imageweb.zoo.ox.ac.uk/wiki/index.php/FlyWeb_ bear this notice and the full citation on the first page. To copy otherwise, to project republish, to post on servers or to redistribute to lists, requires prior specific 3 http://www.fly-ted.org/ permission and/or a fee. 4 Copyright is held by the author/owner(s). LDOW2008, April 22, 2008, http://www.fruitfly.org/ 5 Beijing, China. http://flybase.org/ of the writing, concentrating first on linking the FlyTED 4. NAMED GRAPHS and BDGP databases, we have achieved: An RDF graph contains a collection of RDF triples. A • Describing the Drosophila testis images in FlyTED us- named graph is an RDF graph which is assigned a name ing an extension to the Fly Anatomy Ontology [1], in the form of a URI [3]. It provides a way to group RDF which is also used by BDGP to describe its Drosophila statements into sub-graphs that may be asserted separately, embryo gene expression images. and it also provides names for such graphs. By grouping and naming RDF statements as a named graph, applications can • Publishing FlyTED, the Drosophila testis gene expres- state access control rights, copyright, or provenance informa- sion image database, through a SPARQL endpoint [9], tion about these RDF statements as a whole. Thus, named the same interface used by BDGP for publishing its graphs provide a mechanism for establishing trust within gene expression images and annotations [5]. the Semantic Web. More generally, this mechanism allows us to make statements about the content of the graph with- • Identifying the relationships between FlyTED and BDGP out asserting that the statements contained in the graph are using the genomic knowledge, particularly the gene true. names, captured in FlyBase. In order to provide information about why a pair of related These initial works provide the foundation that permits us data items are linked together in FlyWeb, or why/when they to align the two data resources and build a lightweight data become no longer linked, we create a named graph for each web. However, the evolving nature of biological databases pair of linked data items. In this position paper, we only has motivated us to further consider how to manage the links consider two types of links between data items, i.e. either in FlyWeb, once they have been established. they are same as or different from each other. There may be other types of data links in FlyWeb. But the provenance model introduced in this paper is not yet designed for de- 3. MOTIVATION scribing all different types of data links. Data items from different Drosophila data resources are FlyWeb will be updated whenever a major release of the integrated into FlyWeb using references to the original data. linked-in Drosophila databases is announced. To provide in- Related data items are linked together in FlyWeb using bi- formation about each FlyWeb release and the versions of the ological knowledge from public genomic databases. Biologi- public databases upon which each release is based, we will cal knowledge is growing rapidly, and genomic databases are also create a named graph for each release of the FlyWeb. frequently updated. By referencing back to these evolving In this position paper we use TriG as the notation to de- databases, FlyWeb can synchronize with advances in biolog- fine named graphs. “TriG is a variation of Turtle [2] which ical knowledge. However, with each update of such an exter- extends that notation by using ‘{’ and ‘}’ to group triples nal resource, some of the links between data items recorded into multiple graphs, and to precede each by the name of locally within FlyWeb may become obsolete or need to be that graph” [3]. updated with more links to related data items. We need to provide additional metadata about these data links in order 4.1 A Named Graph for Each FlyWeb Release to maintain consistency between FlyWeb and the advancing biological knowledge. This will allow scientists to: FlyWeb integrates several Drosophila data sources, noted as a, b, c, etc. Each data source is associated with version • Trust that the data links established in FlyWeb are information. Thus ax indicates version x of data source a. valid; Each release of FlyWeb (f wg , f wk , etc) will contain a collection of data items, im , in , etc, from different Drosophila • Trust that the data referenced in FlyWeb are consis- data sources. A data item from data source a of version x tent with the latest release of the public databases; should be uniquely identified as im (ax ). In f wg , im (ax ) • Trace back the data links established by FlyWeb using will be described by all the metedata from its original data previous releases of the public databases, which may source, as well as by FlyWeb statements about whether it is previously have been used by the scientists to annotate related to another data item in (bx ) from data source bx , or their own local data. whether im (ax ) had previously been linked with in (bx ) in a previous release of FlyWeb. Thus, for each data link to a pair of data items, we need Each release of FlyWeb itself is a named graph, which is to record the following provenance metadata: associated with information about when it was released, by whom, using which versions of which databases. An example • The evidence of the link; of two such named graphs is given below. • When this link was created, by whom, using which The following two examples show two graphs (see Fig- version of which database; ure 1). Example 1 tells information about FlyWeb version 1.0 () that it was released on “2007-12- • When this link was updated or deprecated; 19” and it was built using FlyTED version 1.0, the BDGP database version “2007-03-09” and FlyBase version 4.3. It • Whether there were any previous links between this also contains a single statement about data items, i.e. the pair of data items; gene from FlyTED is the same as the • What previous links between data items became obso- gene 6 from BDGP. Example 2 defines a lete, and why. 6 The bdgp namespace might not be the actual namespace used To express this provenance of data links, we propose to by the BDGP SPARQL endpoint. Due to technical maintenance, use named graphs. its server was unreachable when the paper was written. named graph of FlyWeb version 1.1, which was built using :flyweb_r2 dc:created "2008-01-25"^^xsd:date; the same versions of FlyTED and BDGP as FlyWeb ver- dc:hasVersion "1.1" ; sion 1.0, but a different version of FlyBase. Because of this dc:creator update, gene is no longer the same as ; . dw:derivedFrom flyted:v1.0 ; dw:derivedFrom ; dw:derivedFrom . } 4.2 A Named Graph for Each Data Link In each FlyWeb named graph, a collection of named graphs are also created for the data links between pairs of related data items. Each such named graph states: • Why a pair of data items should be or should no longer be linked; • When the link was made or released, and by whom; • Which previous links had been created between this pair of data items; • What the type the link is: a MappingRelation, either a SameRelation or a DifferentRelation. The last two concepts will be defined in a data web ontology using the owl:sameAs and owl:differentFrom properties. Example 3 (see Figure 2) shows a named graph that defines an abstract relationship between the gene from FlyTED () and the gene from BDGP () and traces this relationship Example 1. Named graph for FlyWeb release 1.0 by its two children, both of which are themselves named graphs and define the actual relationships between these two @prefix dw: . genes built in different releases of FlyWeb. @prefix flyted: . @prefix dc: . @prefix xsd: . @prefix owl: . @prefix bdgp: . @prefix dwi: . @prefix : . :flyweb_r1 { flyted:gene_g1 owl:sameAs bdgp:gene_g2 . :flyweb_r1 dc:created "2007-12-19"^^xsd:date; dc:hasVersion "1.0" ; dc:creator ; dw:derivedFrom flyted:v1.0 ; dw:derivedFrom ; dw:derivedFrom . } Example 2. Named graph for FlyWeb release 1.1 @prefix dw: . @prefix flyted: . Figure 2: The named graph for Example 3. @prefix dc: . @prefix xsd: . The first child defines that the two @prefix owl: . genes are synonyms given the evidence of . e1> and that this link was created on “2007-12-19” within @prefix dwi: . the release of . The second child . mapping_m12> states that the two genes are not the same given the evidence of , and that this :flyweb_r2 { link was created on “2008-01-25” within the release of . The dw:childOf property links and local legacy data will have been annotated using informa- with the graph , and tion from a now out-of-date version of the public database. they are linked together by the property dw:siblingOf. These Subsequent releases of the public database might have anno- properties enable us to trace a lineage of the data links be- tated its gene records using different gene names. Occasion- tween a pair of data items. ally, new biological evidence shows that a particular DNA sequence, formerly thought to be a single gene and given a Example 3. Named graph for a data link single gene name, in fact encodes two distinct genes that are then given different names. @prefix dw: . Without provenance data, users would not be able to find @prefix flyted: . in FlyWeb any data relating to their locally recorded former @prefix dc: . gene names, because the genes are now annotated with new @prefix xsd: . names. In order to prevent this situation in the future, we @prefix owl: . provide provenance information for each release of FlyWeb, @prefix bdgp: . to state which versions of the public databases it links to. @prefix dwi: . This provides the flexibility for the scientists to trace data @prefix : . links for their legacy data. A SPARQL query [6] for this @prefix rdf: scenario is shown below, which will search for all the data . items that are linked to the gene in the release of FlyWeb that was built using FlyBase version 4.3. :mapping_m1 { SELECT * :mapping_m1 rdf:type dw:MappingRelation . WHERE { ?g dw:derivedFrom flyted:gene_g1 dw:maps bdgp:gene_g2 . graph ?g { # the first child { flyted:gene_g1 ?p ?data } :mapping_m11 dw:childOf :mapping_m1 ; UNION dw:evidencedBy :evidence_e1 ; { ?data1 ?p1 flyted:gene_g1 } } dw:createdIn :flyweb_r1 ; } rdf:type dw:SameRelation ; dc:creation "2007-12-19"^^xsd:date . 5.2 All Links in the Latest Release :mapping_m11 { This scenario shows how users can navigate information flyted:gene_g1 owl:sameAs bdgp:gene_g2 . about a Drosophila gene in the latest release of FlyWeb us- } ing the version information and the creation date associated with the named graph of each release of FlyWeb. The fol- # the second child lowing SPARQL query will retrieve all the data links from :mapping_m12 dw:childOf :mapping_m1 ; the v1.1. release of FlyWeb. dw:evidencedBy :evidence_e2 ; dw:siblingOf :mapping_m11 ; SELECT * dw:createdIn :flyweb_r2 ; WHERE { ?g dc:hasVersion "1.1" . rdf:type graph ?g {?gene1 ?p ?gene2 } } dw:DifferentRelation ; dc:creation 5.3 Explaining Conflicts "2008-01-25"^^xsd:date . One way of allowing users to trace the data links between :mapping_m12 { a pair of related data items is to keep a history of all the flyted:gene_g1 owl:differentFrom bdgp:gene_g2 . data links that have ever existed between them. This means } that conflicting statements about the relationship between } the same pair of data items might exist in different releases of FlyWeb. In order to explain these conflicts, we provide the evidence information for the data links. 5. SCENARIOS Example 1 and Example 2, describing release1.0 and 1.1 of This section uses the above example datasets to walk FlyWeb, contain conflicting statements about the relation- through three scenarios to show how the named graphs could ships between and . In help us to manage the data links in FlyWeb in a manner that order to explain this conflict, we need to take the following promotes trust. steps: • Retrieve all the statements about the data link be- 5.1 Links in a Previous Release tween and from The first scenario shows how FlyWeb can help users to different releases of FlyWeb. This will return all the find out which data items in FlyWeb are linked to their statements about the graphs and gene, which is annotated using information from FlyBase that define the relationships be- release 4.3. tween the two gene names; Many biology data compilations are maintained locally • Compare the statements about these two graphs in by research groups and might not always be kept up-to-date order to find out the differences between the two ver- with successive releases of the genomic database FlyBase sions of relationships between and due to the ending of the projects that funded them. Such ; • Present the differences resulting from the above com- research topics for Semantic Web and provenance for life sci- parison step to the users, including their creation date, ences [3, 8]. The datasets published by BDGP through their in which release of FlyWeb they were created, as well as SPARQL endpoint have been annotated with some prove- the evidence for explaining why each different relation- nance and evidence information [5]. Those data provenance ship existed between and . other descriptions concerning the data. We need to research A SPARQL query for the first step would be: how this provenance of data can best be incorporated into FlyWeb, together with the provenance of the data links. CONSTRUCT {?cg ?p ?o} WHERE { graph ?g {flyted:gene_g1 ?p1 bdgp:gene_g2 . 7. ACKNOWLEDGEMENT ?g rdf:type dw:MappingRelation . This work is supported by funding from the JISC ?cg dw:childOf ?g . (FlyWeb Project to Dr David Shotton; http://imageweb. ?cg ?p ?o} zoo.ox.ac.uk/wiki/index.php/FlyWeb_project) and from } BBSRC (Grant BB/E018068/1, The FlyData Project: Deci- sion Support and Semantic Organisation of Laboratory Data in Drosophila Gene Expression Experiments, to Drs David 6. CONCLUSIONS Shotton and Helen White-Cooper). The FlyTED Database In this position paper we have analyzed how recording was developed with funding from the UK’s BBSRC (Grant the provenance of data links can help us both maintaining BB/C503903/1, Gene Expression in the Drosophila testis, to the links between related data items and bringing trust to Drs Helen White-Cooper and David Shotton). Preliminary the data web, by providing evidence for links, or tracing data web requirements analysis was supported by a JISC how the data links have been updated and maintained. We grant to Dr David Shotton (Defining Image Access Project; have shown the potential of named graphs for expressing http://imageweb.zoo.ox.ac.uk/wiki/index.php/Defining this provenance information. The flexibility of RDF named ImageAccess). graphs and the RDF query language SPARQL provide the capability for us to query and filter the data links on behalf 8. REFERENCES of the data web users, e.g. by presenting only those links [1] M. Ashburner and et al. A structured controlled newly created since the previous release of FlyWeb, or those vocabulary of the anatomy of Drosophila melanogaster. present in a particular earlier release of FlyWeb. http://obofoundry.org/cgi-bin/detail.cgi?id= When defining this conceptual provenance model, we have fly_anatomy. adopted existing vocabulary as much as possible, such as the [2] D. Beckett. Turtle - Terse RDF Triple Language, 2007. properties of dc:creation and dc:creator from the Dublin http://www.dajobe.org/2004/01/turtle/. Core Metadata Element Set7 . We have also used the dw namespace (http://www.datawebs.net/) to specify the fol- [3] J. Carroll, C. Bizer, P. Hayes, and P. Stickler. Named lowing properties of our own: graphs, provenance and trust. In Proc. of the 14th • dw:derivedFrom International World Wide Web Conference, pages • dw:evidencedBy 613–622, Chiba, Japan, 2005. • dw:childOf http://doi.acm.org/10.1145/1060745.1060835. • dw:siblingOf [4] M. Y. Galperin. The molecular biology database • dw:createdIn collection: 2008 update. Nucleic Acids Research, • dw:maps 36(Database issue):2–4, 2008. We are planning to include these conceptual properties in a doi:10.1093/nar/gkm1037. data web provenance ontology, that will include other exist- [5] C. Mungall. A SPARQL endpoint for a database of ing vocabularies about provenance and trust. annotated gene expression. In this conceptual model we associated with each data http://www.bioontology.org/wiki/index.php/OBD: link a dw:evidencedBy property to provide the information SPARQL-InSitu. about why particular statements were asserted. This will [6] E. Prud’hommeaux and A. Seaborne. SPARQL query bring trust to the linked data for the scientists, so that they language for RDF, January 2008. W3C can verify that the links are consistent with scientific knowl- Recommendation. edge. However, we are still investigating how much infor- http://www.w3.org/TR/rdf-sparql-query/. mation should be provided as evidence for each data link: [7] D. Shotton. World Wide Science: Promises, Threats whether it should contain the actual heuristic used for build- and Realities, chapter Data webs for image repositories. ing the links or a textual description of this heuristic; and Oxford University Press, 2008. in press. how we can make this evidence information more compre- [8] J. Zhao, C. Goble, R. Stevens, and D. Turi. Mining hensible for biological researchers. Taverna’s Semantic Web of Provenance. Journal of There is a separate provenance issue that is not discussed Concurrency and Computation:Practice and in this position paper, namely the provenance of the data Experience, 2007. doi:10.1002/cpe.1231. items themselves. We discussed neither the provenance in- [9] J. Zhao, G. Klyne, and D. Shotton. Building a semantic formation for telling where each data item came from nor web image repository for biological research images. In the provenance information that might be associated with a Proc. of the 5th European Semantic Web Conference, data item from the individual data resource. These are key Tenerife, Spain, 2008. accepted. 7 http://www.dublincore.org/documents/dces/