=Paper= {{Paper |id=Vol-369/paper-8 |storemode=property |title=Provenance and Linked Data in Biological Data Webs |pdfUrl=https://ceur-ws.org/Vol-369/paper07.pdf |volume=Vol-369 |dblpUrl=https://dblp.org/rec/conf/www/ZhaoKS08 }} ==Provenance and Linked Data in Biological Data Webs== https://ceur-ws.org/Vol-369/paper07.pdf
        Provenance and Linked Data in Biological Data Webs

                                              Jun Zhao, Graham Klyne and David Shotton
                                                         Image Bioinformatics Research Group
                                                               Department of Zoology
                                                                 University of Oxford
                                                                Oxford OX1 3PS, UK
                                      {jun.zhao,graham.klyne,david.shotton}@zoo.ox.ac.uk


ABSTRACT                                                                                To simplify this process, the Image Bioinformatics Re-
To created a linked data web of heterogeneous biological                             search Group (IBRG)1 of the University of Oxford proposes
data resources, we need not only to define and create the                            the use of subject-specific data webs, which use the Web
alignment between related data resources but also to ex-                             as the native platform upon which to integrate access to
press the knowledge about why data items from different                              datasets relating to particular subjects [7]. Within each
sources are linked with each other and how each data link                            data web, data resources are integrated using loosely cou-
has evolved, so that scientists can trust the data links pro-                        pled software tools that permit both information discovery
vided by the data web. This paper highlights the importance                          and links back to the original data. With this approach,
of keeping provenance information about the links between                            the data linked into the data webs are neither required to
data items from different sources, and proposes the use of                           be semantically coordinated nor constrained to conform to
named graphs to make a provenance statement about each                               a single imposed model. Furthermore, copyright and ac-
pair of linked data items and each release of a data web.                            cess control issues remain the concern of the data sources,
                                                                                     not of the data web that unites them. These data sources
                                                                                     maintain their unique characters and continue independent
Categories and Subject Descriptors                                                   publication of their holdings.
D.2.12 [Software Engineering]: Interoperability—Data map-                               The first demonstrator data web being developed by IBRG
ping; H.3.5 [Information Storage And Retrieval]: On-                                 is FlyWeb2 , which will integrate the heterogeneous data re-
line Information Services                                                            sources concerning research on fruit fly Drosophila melano-
                                                                                     gaster. These data resources include FlyTED3 (our local
                                                                                     research image repository concerning gene expression in the
General Terms                                                                        testis of fruit flies), BDGP4 (the Berkeley Drosophila Genome
Language, Reliability                                                                Project database concerning gene expression in the Drosophila
                                                                                     embryos), FlyBase5 (the global database of genomics infor-
                                                                                     mation concerning Drosophila), and online research publi-
Keywords                                                                             cations on Drosophila gene expressions. The goal of Fly-
Data Web, Named Graphs, Provenance, RDF, Semantic                                    Web is to allow biologists to obtain information about a
Web, Trust                                                                           Drosophila gene, including the gene expression images of
                                                                                     its testis and embryos, without having to hop between the
                                                                                     Drosophila data islands on the Web.
1.     INTRODUCTION                                                                     To build FlyWeb, we need not only to define and im-
  The number of biology databases available has increased                            plement the alignment between Drosophila data resources,
rapidly in the recent years [4]. To obtain knowledge about a                         but also to maintain the data links between related data
gene or protein from this sea of data, biologists often need to                      items from different sources. This position paper focuses on
go through an information gathering process, navigating be-                          the second issue, and will analyze the motivation for keep-
tween the public genomic and publication databases. These                            ing provenance of the links between related data items and
resources are scattered around the world and present data                            present our proposed solution.
in heterogeneous formats. Scientists have to rely on their
domain knowledge in order to identify how data resources                             2.   SEMANTIC WEB AND FLYWEB
are linked with each other.                                                            The initial development goals of the FlyWeb Project in-
                                                                                     clude understanding the distributed Drosophila data resources;
                                                                                     creating the alignment between them; and creating a query
                                                                                     service to access the integrated data resources. At the time
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are            1
                                                                                       http://ibrg.zoo.ox.ac.uk/
not made or distributed for profit or commercial advantage and that copies           2
                                                                                       http://imageweb.zoo.ox.ac.uk/wiki/index.php/FlyWeb_
bear this notice and the full citation on the first page. To copy otherwise, to      project
republish, to post on servers or to redistribute to lists, requires prior specific   3
                                                                                       http://www.fly-ted.org/
permission and/or a fee.                                                             4
Copyright is held by the author/owner(s). LDOW2008, April 22, 2008,                    http://www.fruitfly.org/
                                                                                     5
Beijing, China.                                                                        http://flybase.org/
of the writing, concentrating first on linking the FlyTED        4.    NAMED GRAPHS
and BDGP databases, we have achieved:                               An RDF graph contains a collection of RDF triples. A
     • Describing the Drosophila testis images in FlyTED us-     named graph is an RDF graph which is assigned a name
       ing an extension to the Fly Anatomy Ontology [1],         in the form of a URI [3]. It provides a way to group RDF
       which is also used by BDGP to describe its Drosophila     statements into sub-graphs that may be asserted separately,
       embryo gene expression images.                            and it also provides names for such graphs. By grouping and
                                                                 naming RDF statements as a named graph, applications can
     • Publishing FlyTED, the Drosophila testis gene expres-     state access control rights, copyright, or provenance informa-
       sion image database, through a SPARQL endpoint [9],       tion about these RDF statements as a whole. Thus, named
       the same interface used by BDGP for publishing its        graphs provide a mechanism for establishing trust within
       gene expression images and annotations [5].               the Semantic Web. More generally, this mechanism allows
                                                                 us to make statements about the content of the graph with-
     • Identifying the relationships between FlyTED and BDGP     out asserting that the statements contained in the graph are
       using the genomic knowledge, particularly the gene        true.
       names, captured in FlyBase.                                  In order to provide information about why a pair of related
  These initial works provide the foundation that permits us     data items are linked together in FlyWeb, or why/when they
to align the two data resources and build a lightweight data     become no longer linked, we create a named graph for each
web. However, the evolving nature of biological databases        pair of linked data items. In this position paper, we only
has motivated us to further consider how to manage the links     consider two types of links between data items, i.e. either
in FlyWeb, once they have been established.                      they are same as or different from each other. There may
                                                                 be other types of data links in FlyWeb. But the provenance
                                                                 model introduced in this paper is not yet designed for de-
3.    MOTIVATION                                                 scribing all different types of data links.
   Data items from different Drosophila data resources are          FlyWeb will be updated whenever a major release of the
integrated into FlyWeb using references to the original data.    linked-in Drosophila databases is announced. To provide in-
Related data items are linked together in FlyWeb using bi-       formation about each FlyWeb release and the versions of the
ological knowledge from public genomic databases. Biologi-       public databases upon which each release is based, we will
cal knowledge is growing rapidly, and genomic databases are      also create a named graph for each release of the FlyWeb.
frequently updated. By referencing back to these evolving           In this position paper we use TriG as the notation to de-
databases, FlyWeb can synchronize with advances in biolog-       fine named graphs. “TriG is a variation of Turtle [2] which
ical knowledge. However, with each update of such an exter-      extends that notation by using ‘{’ and ‘}’ to group triples
nal resource, some of the links between data items recorded      into multiple graphs, and to precede each by the name of
locally within FlyWeb may become obsolete or need to be          that graph” [3].
updated with more links to related data items. We need to
provide additional metadata about these data links in order      4.1    A Named Graph for Each FlyWeb Release
to maintain consistency between FlyWeb and the advancing
biological knowledge. This will allow scientists to:                FlyWeb integrates several Drosophila data sources, noted
                                                                 as a, b, c, etc. Each data source is associated with version
     • Trust that the data links established in FlyWeb are       information. Thus ax indicates version x of data source a.
       valid;                                                       Each release of FlyWeb (f wg , f wk , etc) will contain a
                                                                 collection of data items, im , in , etc, from different Drosophila
     • Trust that the data referenced in FlyWeb are consis-      data sources. A data item from data source a of version x
       tent with the latest release of the public databases;     should be uniquely identified as im (ax ). In f wg , im (ax )
     • Trace back the data links established by FlyWeb using     will be described by all the metedata from its original data
       previous releases of the public databases, which may      source, as well as by FlyWeb statements about whether it is
       previously have been used by the scientists to annotate   related to another data item in (bx ) from data source bx , or
       their own local data.                                     whether im (ax ) had previously been linked with in (bx ) in a
                                                                 previous release of FlyWeb.
  Thus, for each data link to a pair of data items, we need         Each release of FlyWeb itself is a named graph, which is
to record the following provenance metadata:                     associated with information about when it was released, by
                                                                 whom, using which versions of which databases. An example
     • The evidence of the link;                                 of two such named graphs is given below.
     • When this link was created, by whom, using which             The following two examples show two graphs (see Fig-
       version of which database;                                ure 1). Example 1 tells information about FlyWeb version
                                                                 1.0 () that it was released on “2007-12-
     • When this link was updated or deprecated;                 19” and it was built using FlyTED version 1.0, the BDGP
                                                                 database version “2007-03-09” and FlyBase version 4.3. It
     • Whether there were any previous links between this
                                                                 also contains a single statement about data items, i.e. the
       pair of data items;
                                                                 gene  from FlyTED is the same as the
     • What previous links between data items became obso-       gene  6 from BDGP. Example 2 defines a
       lete, and why.
                                                                 6
                                                                   The bdgp namespace might not be the actual namespace used
  To express this provenance of data links, we propose to        by the BDGP SPARQL endpoint. Due to technical maintenance,
use named graphs.                                                its server was unreachable when the paper was written.
named graph of FlyWeb version 1.1, which was built using            :flyweb_r2  dc:created "2008-01-25"^^xsd:date;
the same versions of FlyTED and BDGP as FlyWeb ver-                             dc:hasVersion "1.1"   ;
sion 1.0, but a different version of FlyBase. Because of this                   dc:creator
update, gene  is no longer the same as                     ;
.                                                                 dw:derivedFrom flyted:v1.0 ;
                                                                                dw:derivedFrom  ;
                                                                                dw:derivedFrom  .
                                                                }

                                                                4.2    A Named Graph for Each Data Link
                                                                  In each FlyWeb named graph, a collection of named graphs
                                                                are also created for the data links between pairs of related
                                                                data items. Each such named graph states:
                                                                    • Why a pair of data items should be or should no longer
                                                                      be linked;
                                                                    • When the link was made or released, and by whom;
                                                                    • Which previous links had been created between this
                                                                      pair of data items;
                                                                    • What the type the link is: a MappingRelation, either a
                                                                      SameRelation or a DifferentRelation. The last two
                                                                      concepts will be defined in a data web ontology using
                                                                      the owl:sameAs and owl:differentFrom properties.
                                                                   Example 3 (see Figure 2) shows a named graph  that defines an abstract relationship between
                                                                the gene from FlyTED () and the gene
                                                                from BDGP () and traces this relationship
  Example 1. Named graph for FlyWeb release 1.0                 by its two children, both of which are themselves named
                                                                graphs and define the actual relationships between these two
@prefix dw:  .                        genes built in different releases of FlyWeb.
@prefix flyted:  .
@prefix dc:  .
@prefix xsd:  .
@prefix owl:  .
@prefix bdgp:  .
@prefix dwi:  .
@prefix :  .

:flyweb_r1 {
  flyted:gene_g1 owl:sameAs bdgp:gene_g2 .
  :flyweb_r1 dc:created "2007-12-19"^^xsd:date;
              dc:hasVersion "1.0"   ;
              dc:creator
         ;
              dw:derivedFrom flyted:v1.0 ;
              dw:derivedFrom  ;
              dw:derivedFrom  .
}

  Example 2. Named graph for FlyWeb release 1.1
@prefix dw:  .
@prefix flyted:  .                            Figure 2: The named graph for Example 3.
@prefix dc:  .
@prefix xsd:  .                 The first child  defines that the two
@prefix owl:  .                 genes are synonyms given the evidence of  .                      e1> and that this link was created on “2007-12-19” within
@prefix dwi:  .                        the release of . The second child  .                           mapping_m12> states that the two genes are not the same
                                                                given the evidence of , and that this
:flyweb_r2 {                                                    link was created on “2008-01-25” within the release of .
  The dw:childOf property links  and           local legacy data will have been annotated using informa-
 with the graph , and          tion from a now out-of-date version of the public database.
they are linked together by the property dw:siblingOf. These    Subsequent releases of the public database might have anno-
properties enable us to trace a lineage of the data links be-   tated its gene records using different gene names. Occasion-
tween a pair of data items.                                     ally, new biological evidence shows that a particular DNA
                                                                sequence, formerly thought to be a single gene and given a
     Example 3. Named graph for a data link                     single gene name, in fact encodes two distinct genes that are
                                                                then given different names.
@prefix dw:            .                 Without provenance data, users would not be able to find
@prefix flyted:         .               in FlyWeb any data relating to their locally recorded former
@prefix dc:    .              gene names, because the genes are now annotated with new
@prefix xsd:  .              names. In order to prevent this situation in the future, we
@prefix owl:     .              provide provenance information for each release of FlyWeb,
@prefix bdgp:          .              to state which versions of the public databases it links to.
@prefix dwi:            .              This provides the flexibility for the scientists to trace data
@prefix :               .              links for their legacy data. A SPARQL query [6] for this
@prefix rdf:                                                    scenario is shown below, which will search for all the data
    .              items that are linked to the gene  in the
                                                                release of FlyWeb that was built using FlyBase version 4.3.

:mapping_m1 {                                                   SELECT *
   :mapping_m1      rdf:type   dw:MappingRelation .             WHERE { ?g dw:derivedFrom 
   flyted:gene_g1 dw:maps           bdgp:gene_g2 .                       graph ?g {
   # the first child                                                         { flyted:gene_g1 ?p ?data }
   :mapping_m11 dw:childOf        :mapping_m1     ;                          UNION
                  dw:evidencedBy :evidence_e1     ;                          { ?data1 ?p1 flyted:gene_g1 } }
                  dw:createdIn    :flyweb_r1      ;             }
                  rdf:type        dw:SameRelation ;
                  dc:creation
                           "2007-12-19"^^xsd:date .
                                                                5.2   All Links in the Latest Release
   :mapping_m11 {                                                 This scenario shows how users can navigate information
         flyted:gene_g1 owl:sameAs bdgp:gene_g2 .               about a Drosophila gene in the latest release of FlyWeb us-
   }                                                            ing the version information and the creation date associated
                                                                with the named graph of each release of FlyWeb. The fol-
     # the second child                                         lowing SPARQL query will retrieve all the data links from
     :mapping_m12 dw:childOf        :mapping_m1      ;          the v1.1. release of FlyWeb.
                    dw:evidencedBy :evidence_e2      ;
                    dw:siblingOf    :mapping_m11     ;          SELECT *
                    dw:createdIn    :flyweb_r2       ;          WHERE { ?g dc:hasVersion "1.1" .
                    rdf:type                                             graph ?g {?gene1 ?p ?gene2 } }
                                dw:DifferentRelation ;
                    dc:creation
                                                                5.3   Explaining Conflicts
                             "2008-01-25"^^xsd:date .             One way of allowing users to trace the data links between
     :mapping_m12 {                                             a pair of related data items is to keep a history of all the
     flyted:gene_g1 owl:differentFrom bdgp:gene_g2 .            data links that have ever existed between them. This means
     }                                                          that conflicting statements about the relationship between
}                                                               the same pair of data items might exist in different releases
                                                                of FlyWeb. In order to explain these conflicts, we provide
                                                                the evidence information for the data links.
5.     SCENARIOS                                                  Example 1 and Example 2, describing release1.0 and 1.1 of
  This section uses the above example datasets to walk          FlyWeb, contain conflicting statements about the relation-
through three scenarios to show how the named graphs could      ships between  and . In
help us to manage the data links in FlyWeb in a manner that     order to explain this conflict, we need to take the following
promotes trust.                                                 steps:
                                                                   • Retrieve all the statements about the data link be-
5.1     Links in a Previous Release                                   tween  and  from
   The first scenario shows how FlyWeb can help users to              different releases of FlyWeb. This will return all the
find out which data items in FlyWeb are linked to their               statements about the graphs  and
gene, which is annotated using information from FlyBase                that define the relationships be-
release 4.3.                                                          tween the two gene names;
   Many biology data compilations are maintained locally           • Compare the statements about these two graphs in
by research groups and might not always be kept up-to-date            order to find out the differences between the two ver-
with successive releases of the genomic database FlyBase              sions of relationships between  and
due to the ending of the projects that funded them. Such              ;
      • Present the differences resulting from the above com-       research topics for Semantic Web and provenance for life sci-
        parison step to the users, including their creation date,   ences [3, 8]. The datasets published by BDGP through their
        in which release of FlyWeb they were created, as well as    SPARQL endpoint have been annotated with some prove-
        the evidence for explaining why each different relation-    nance and evidence information [5]. Those data provenance
        ship existed between  and .                                                   other descriptions concerning the data. We need to research
     A SPARQL query for the first step would be:                    how this provenance of data can best be incorporated into
                                                                    FlyWeb, together with the provenance of the data links.
CONSTRUCT {?cg ?p ?o}
WHERE {
   graph ?g {flyted:gene_g1 ?p1 bdgp:gene_g2 .
                                                                    7.   ACKNOWLEDGEMENT
             ?g rdf:type dw:MappingRelation .                          This work is supported by funding from the JISC
             ?cg dw:childOf ?g .                                    (FlyWeb Project to Dr David Shotton; http://imageweb.
             ?cg    ?p     ?o}                                      zoo.ox.ac.uk/wiki/index.php/FlyWeb_project) and from
}                                                                   BBSRC (Grant BB/E018068/1, The FlyData Project: Deci-
                                                                    sion Support and Semantic Organisation of Laboratory Data
                                                                    in Drosophila Gene Expression Experiments, to Drs David
6.      CONCLUSIONS                                                 Shotton and Helen White-Cooper). The FlyTED Database
   In this position paper we have analyzed how recording            was developed with funding from the UK’s BBSRC (Grant
the provenance of data links can help us both maintaining           BB/C503903/1, Gene Expression in the Drosophila testis, to
the links between related data items and bringing trust to          Drs Helen White-Cooper and David Shotton). Preliminary
the data web, by providing evidence for links, or tracing           data web requirements analysis was supported by a JISC
how the data links have been updated and maintained. We             grant to Dr David Shotton (Defining Image Access Project;
have shown the potential of named graphs for expressing             http://imageweb.zoo.ox.ac.uk/wiki/index.php/Defining
this provenance information. The flexibility of RDF named           ImageAccess).
graphs and the RDF query language SPARQL provide the
capability for us to query and filter the data links on behalf      8.   REFERENCES
of the data web users, e.g. by presenting only those links          [1] M. Ashburner and et al. A structured controlled
newly created since the previous release of FlyWeb, or those            vocabulary of the anatomy of Drosophila melanogaster.
present in a particular earlier release of FlyWeb.                      http://obofoundry.org/cgi-bin/detail.cgi?id=
   When defining this conceptual provenance model, we have              fly_anatomy.
adopted existing vocabulary as much as possible, such as the
                                                                    [2] D. Beckett. Turtle - Terse RDF Triple Language, 2007.
properties of dc:creation and dc:creator from the Dublin
                                                                        http://www.dajobe.org/2004/01/turtle/.
Core Metadata Element Set7 . We have also used the dw
namespace (http://www.datawebs.net/) to specify the fol-            [3] J. Carroll, C. Bizer, P. Hayes, and P. Stickler. Named
lowing properties of our own:                                           graphs, provenance and trust. In Proc. of the 14th
    • dw:derivedFrom                                                    International World Wide Web Conference, pages
    • dw:evidencedBy                                                    613–622, Chiba, Japan, 2005.
    • dw:childOf                                                        http://doi.acm.org/10.1145/1060745.1060835.
    • dw:siblingOf                                                  [4] M. Y. Galperin. The molecular biology database
    • dw:createdIn                                                      collection: 2008 update. Nucleic Acids Research,
    • dw:maps                                                           36(Database issue):2–4, 2008.
We are planning to include these conceptual properties in a             doi:10.1093/nar/gkm1037.
data web provenance ontology, that will include other exist-        [5] C. Mungall. A SPARQL endpoint for a database of
ing vocabularies about provenance and trust.                            annotated gene expression.
   In this conceptual model we associated with each data                http://www.bioontology.org/wiki/index.php/OBD:
link a dw:evidencedBy property to provide the information               SPARQL-InSitu.
about why particular statements were asserted. This will            [6] E. Prud’hommeaux and A. Seaborne. SPARQL query
bring trust to the linked data for the scientists, so that they         language for RDF, January 2008. W3C
can verify that the links are consistent with scientific knowl-         Recommendation.
edge. However, we are still investigating how much infor-               http://www.w3.org/TR/rdf-sparql-query/.
mation should be provided as evidence for each data link:           [7] D. Shotton. World Wide Science: Promises, Threats
whether it should contain the actual heuristic used for build-          and Realities, chapter Data webs for image repositories.
ing the links or a textual description of this heuristic; and           Oxford University Press, 2008. in press.
how we can make this evidence information more compre-              [8] J. Zhao, C. Goble, R. Stevens, and D. Turi. Mining
hensible for biological researchers.                                    Taverna’s Semantic Web of Provenance. Journal of
   There is a separate provenance issue that is not discussed           Concurrency and Computation:Practice and
in this position paper, namely the provenance of the data               Experience, 2007. doi:10.1002/cpe.1231.
items themselves. We discussed neither the provenance in-           [9] J. Zhao, G. Klyne, and D. Shotton. Building a semantic
formation for telling where each data item came from nor                web image repository for biological research images. In
the provenance information that might be associated with a              Proc. of the 5th European Semantic Web Conference,
data item from the individual data resource. These are key              Tenerife, Spain, 2008. accepted.
7
    http://www.dublincore.org/documents/dces/