Capturing Provenance for a Linkset of
                      Convenience

             Simon Jupp1 , James Malone1 , and Alasdair J G Gray2
      1
         European Molecular Biology Laboratory, European Bioinformatics Institute
     (EMBL-EBI), Wellcome Trust Genome Campus, Cambridge, United Kingdom
    2
        Department of Computer Science, Heriot-Watt University, Edinburgh, United
                                      Kingdom


       Abstract. Biological interactions such as those between genes and pro-
       teins are complex and require intricate OWL models. However, direct
       links between biological entities can support search and data integra-
       tion. In this paper we introduce linksets of convenience that capture
       these direct links. We show the provenance statements required to track
       the derivation of such linksets; linking them back to the full biological
       justification.

       Keywords: Data linking, Provenance, VoID


1     Introduction

Investigating biological systems, such as those implicated in disease, necessitates
the connection of many levels of biology; gene, gene variation, gene expression,
protein structure, signalling pathways, phenotypic, epidemiological data and so
on. The ability to integrate data across these levels relies on links that can be
formed between biological entities, for example, going from a gene to proteins or
proteins to pathways. For each of these links there is some biological justification
that may involve several steps (see Section 2 for details). To support tasks such
as search and data integration it is convenient to provide additional shortcuts in
the form of a direct link, e.g. genes to pathways.
    Modeling the true nature of the links using semantic web technologies such
as OWL removes ambiguity when working with data by giving it a well defined
and precise semantics. However it increases the complexity of interacting with
the data as the OWL model needs to capture the full intricacies of the biological
interactions. As we move to publish biological data as linked open data, there
is an opportunity to describe direct links between different types of biological
entities as a shortcut to be made between entities which feature in common
queries, such as gene to protein; capturing the way that biologists often discuss
the domain and enable novel integrations of the data. These direct links provide
a working notion that cuts through the biology but which does not necessitate
capturing (or recapturing) the complex multivariate relationships that can hold
between the two entities. Such linksets are already used to support the Open
2        Jupp, Malone and Gray

                           Ensembl Exon


                            so:exon


                                               so:has_part


            so:gene                       so:transcript                  so:polypeptide                      uniprot:Protein

            Ensembl Gene                  Ensembl Transcript              Ensembl Protein                     UniProt Protein

                 so:transcribed_from                       so:translates_to                 :ep2upRelation


                                                          skos:related


Fig. 1. Linking an Ensembl gene with its UniProt protein. Solid lines show the full
semantic modelling required while the dashed line represents the linkset of convenience.


PHACTS Discovery Platform [1], although those linksets do not have adequate
provenance.
    In this paper we propose a mechanism to model these links of convenience
using a combination of VoID linksets [2] and PROV [3]. We avoid misrepresent-
ing links by applying semantically weaker relationships together with additional
provenance which represents the underlying complexity. We illustrate the model
with an example using data from two popular biological databases.


2     Linking genes to proteins use case.
We motivate our work with an example mapping between Ensembl [4] (a database
of genome annotation) and Uniprot [5] (a database of protein sequences). These
databases already contain cross-references between an Ensembl Gene (EG) and
a Uniprot Protein (UP). However to understand how this mapping is generated
you currently need to discover the correct publications and online documenta-
tion; they are not directly discoverable from the data.
    Biological theory tells us that a gene encodes for a protein, although this
biological relation only truly holds for the link between the EG and the Ensembl
Protein (EP) entity. There are in fact multiple types of UP to EP mappings, for
instance they can be derived from an exact sequence identity or they might be
based on a percentage sequence identity. Figure 1 illustrates how we model EG
to EP using terminology defined in the Sequence Ontology, and for illustration
we include a superproperty of the all the EP to UP mappings that we call
ep2upRelation3 . We introduce a link of convenience (dashed line) that links
the EG to UP that is there to support queries using the semantically weak
skos:related relation. This schema lacks the provenance to assert that the
related link of convenience is derived from the longer chain of semantically richer
links that hold from a gene to protein.
3
    UniProt are currently extending their vocabulary to define these relations.
                                                    Linksets of Convenience       3

1    # define the ensembl protein partition
2    :ensembl void:classPartition :EPpartition .
3    :EPpartition void:class so:Polypeptide .
4

5    # define the Uniprot protein partition
6    :uniprot void:classPartition :UPpartition .
7    :UPpartition void:class uniprot:Protein .
8

9    # define the linkset that links the two partitions
10   :ensemblProteinToUniprotProteinLinkset a void:Linkset ;
11       void:linkPredicate :ep2upRelation ;
12

13   # define partitions for ensembl gene, gene transcript and
14   # transcript protein
15   :ensembl void:classPartition :ensemblGenePartition ;
16       void:propertyPartition :ensemblGeneTranscriptPartition ;
17       void:propertyPartition :ensemblTranscriptProteinPartition ;
18   :ensemblGenePartition void:class so:gene .
19   :ensemblGeneTranscriptPartition void:property so:transcribed_from .
20   :ensemblTranscriptProteinPartition void:property so:translates_to .
21

22   # define the linkset that links the two partitions,
23   # including the dataset description that contains the triples that
24   # are used to derive this linkset
25   :ensemblGeneToUniprotProteinLinkset a void:Linkset ;
26       void:linkPredicate skos:related ;
27       void:subjectsTarget :ensemblGenePartition;
28       void:objectsTarget :UPpartition;
29       prov:wasDerivedFrom :ensemblGeneTranscriptPartition,
30           :ensemblTranscriptProteinPartition,
31           :ensemblProteinToUniprotProteinLinkset


Fig. 2. Description of the linkset of convenience between Ensembl Gene and UniProt
Protein which includes the provenance derivation.


3    Describing Linksets

The model outlined in Figure 1 can be decorated with provenance that captures
additional information about how the link of convenience between EG and UP is
derived. The resulting linkset description is shown in Figure 2. In the following
we describe the blocks of RDF.
    The VoID vocabulary of linked datasets allows the description of RDF links
between datasets using VoID linksets. A linkset allows us to describe the links,
captured as a set of triples, between two datasets. We can use VoID to describe
relevant partitions of the datasets based on individual properties or classes, these
form new subsets that can participate in multiple linksets. In our scenario we
4       Jupp, Malone and Gray

need to capture two crucial linksets; the first is the EP to UP linkset, and the
second is the more convenient EG to UP linkset.
    The EP-UP linkset captures the :ep2upRelation link between types of EP
in the Ensembl dataset, and types of UP in the UniProt dataset (lines 10-11).
We describe two further subsets; the EP partition of all entities that are of type
so:Polypeptide in the Ensembl dataset (lines 2-3) and the UniProt subset of
all entities that are of type uniprot:Protein (lines 6-7).
    The EG to UP link of convenience needs a similar linkset description based
on an EG partition and the previous UP partition, although this time the re-
lation is skos:related (lines 25-26). We also want to capture that the triples
in this linkset are derived from another set of triples. This captures that the
skos:related is a shortcut relation for a more complex path through the RDF
graph. Again we can use VoID partitioning, but this time using a property based
partition to identify the EG to Ensembl Transcript (ET) and ET to EP links
(lines 15-20) . Finally we use the prov:wasDerivedFrom relation to link the con-
venience linkset to the linksets that describe the full path of relations that the
shortcut represents (line 28-30).


4   Discusion
It is always important to try and model your data as accurately as possible,
and publishing data with RDF and OWL is well suited for this task. The VoID
vocabulary already provides a mechanism to define and attach provenance to
linksets between datasets, and we are proposing the use of PROV to connect
linksets that are derived from other linksets. As a Web of linked biological data
emerges, there is a need to identify links that are there for convenience, and
expose how they relate back to the core biological (OWL) model. In cases where
a link of convenience is derived from a series of other linksets, it is desirable to
be able to spot this and unpack the convenience links using common queries.
The model proposed supports this task but questions remain as to whether VoID
and PROV are enough, so we hope this preliminary work can help motivtate the
discussion.

Acknowledgements
EBI contribution supported by EU FP7 BioMedBridges Grant 284209.


References
1. Gray, A.J.G., Groth, P., Loizou, A., Askjaer, S., Brenninkmeijer, C.Y.A., Burger,
   K., Chichester, C., Evelo, C.T., Goble, C.A., Harland, L., Pettifer, S., Thompson,
   M., Waagmeester, A., Williams, A.J.: Applying linked data approaches to pharma-
   cology: Architectural decisions and implementation. Semant. Web 5 (2014) 101–113
2. Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing Linked Datasets
   with the VoID Vocabulary. Note, W3C (March 2011)
                                                    Linksets of Convenience       5

3. Lebo, T., Sahoo, S.S., Mcguinness, D.: PROV-O: The PROV Ontology. Technical
   report, W3C Recommendation (2013) http://www.w3.org/TR/prov-o/.
4. Flicek, P., Amode, M.R., Barrell, D., et al: Ensembl 2014. Nucleic acids research
   42 (2014) D749–D755 doi: 10.1093/nar/gkt1196.
5. The UniProt Consortium: Activities at the universal protein resource (UniProt).
   Nucleic acids research 42 (2014) D191–D198 doi: 10.1093/nar/gkt1140.