=Paper=
{{Paper
|id=Vol-1114/Session4_Brenninkmeijer
|storemode=property
|title=Computing Identity Co-Reference Across Drug Discovery Datasets
|pdfUrl=https://ceur-ws.org/Vol-1114/Session4_Brenninkmeijer.pdf
|volume=Vol-1114
|dblpUrl=https://dblp.org/rec/conf/swat4ls/BrenninkmeijerDGGPS13
}}
==Computing Identity Co-Reference Across Drug Discovery Datasets==
Computing Identity Co-Reference Across Drug Discovery Datasets Christian Y A Brenninkmeijer1 , Ian Dunlop1 , Carole Goble1 , Alasdair J G Gray2 , Steve Pettifer1 , and Robert Stevens1 1 School of Computer Science, University of Manchester, UK. 2 Department of Computer Science, Heriot-Watt University, UK. Abstract. This paper presents the rules used within the Open PHACTS (http://www.openphacts.org) Identity Management Service to com- pute co-reference chains across multiple datasets. The web of (linked) data has encouraged a proliferation of identifiers for the concepts cap- tured in datasets; with each dataset using their own identifier. A key data integration challenge is linking the co-referent identifiers, i.e. identifying and linking the equivalent concept in every dataset. Exacerbating this challenge, the datasets model the data differently, so when is one repre- sentation truly the same as another? Finally, different users have their own task and domain specific notions of equivalence that are driven by their operational knowledge. Consumers of the data need to be able to choose the notion of operational equivalence to be applied for the con- text of their application. We highlight the challenges of automatically computing co-reference and the need for capturing the context of the equivalence. This context is then used to control the co-reference com- putation. Ultimately, the context will enable data consumers to decide which co-references to include in their applications. 1 Introduction Within the life sciences there has been a proliferation of databases published, with the 2013 NAR database issue listing over 1,500 [8]. An increasing number of these are being published as linked data – either directly, e.g. the recent publication of RDF data by the EBI3 [17] or through projects such as Bio2RDF [5] – forming a web of linked data. A key data integration challenge is identifying the “same” concept across these datasets. While there have been attempts to provide global identifiers for a concept, e.g. with life sciences identifiers [7], these have not gained widespread use. Conse- quently, there is no global, or life sciences, identifier scheme used by all datasets to identify a given concept; each dataset uses their own identifier scheme, leading to a proliferation of identifiers for (notionally) the same concept [12]. Then there is the problem of what does it mean to be the same; just repre- senting the truth is not a tenable position. As demonstrated by Halpin et al [15] 3 http://www.ebi.ac.uk/rdf/ accessed November 2013 2 C.Y.A. Brenninkmeijer et al. there are many interpretations of the owl:sameAs relationships that exist in the data. Life sciences datasets contain complimentary and overlapping data, mod- elled with different levels of granularity depending upon the purpose of the data capture. Therefore, when we refer to the data being about the “same” concept, we are not looking for identical representations but rather saying that these two complimentary records can be treated as being operationally equivalent. How- ever, the notion of operational equivalence depends upon the use to which the data will be put and thus can only be decided by the user or the application they use. Linked data allows a user to navigate their way through the web of data by following links from one resource to a related resource. Consequently not every pair of datasets are linked since you can navigate your way through the web of data. However there are many scenarios where you need to know all of the equivalent URIs. For example, to power a linked data integration platform such as the drug discovery platform being developed by the Open PHACTS project [14]. Co-reference services such as sameas.org [11] and BridgeDb [16] provide a look-up service for discovering equivalent identifiers. However, they are a melting pot of equivalence as they do not consider the context of why two things are equivalent. For example, a search on sameas.org with the URI for the UniProt record for “Insulin Receptor (homo-sapien)” 4 results in 30,866 equivalent URIs; the first 1,292 of which are for DBPedia gene entries. Clearly these cannot all be equivalent, particularly since the UniProt record is a protein and not a gene. In order for scientists to trust mappings, they need to understand the context of the equivalence claim and who is making the claim. By providing the context – in scientific terms – together with the provenance of the mapping – how it was made and by whom – the scientist can understand the notion of equivalence captured and make an informed decision about whether to include it in their application. In this paper we – Identify the challenges of identity co-reference across datasets (Section 2); – Discuss the metadata required to describe a dataset and capture the context of its links to other datasets (Section 3); – Present the rules used to control co-reference computation and their usage in the Open PHACTS Identity Management Service (IMS) (Section 4). 2 Multiple Identifiers, but are they the same? Information relevant for drug discovery research is sourced from a variety of overlapping datasets. For example, information about drugs can be retrieved from DrugBank [20], while data about the chemical compounds that compose the drug are available from ChEMBL [10], ChemSpider [21] and DrugBank, and details of the target – typically a protein – that the drug interacts with are available from ChEMBL and UniProt [22]. Since each of these datasets is 4 http://www.uniprot.org/uniprot/P06213 accessed September 2013 Computing Identity Co-Reference 3 modelled with a different focus, and have their own identifier scheme, when can we say that two records are truely equivalent? In some cases it is straightforward. An entry in ChemSpider and an entry in ChEMBL that share the same InChI will report about the same chemical, e.g. “imatinib mesylate”. However, when we consider the drug entry in DrugBank, e.g. “Gleevec”, there can be multiple InChI entries associated, e.g. the “gleevec” entry contains the InChI for both “imatinib” and “imatinib mesylate”. In this case, are the records the “same”. For a scientist interested in “gleevec” they would be, but for someone interested only in “imatinib mesylate” perhaps not. Many datasets contain links to other related datasets. For example, UniProt includes links to several related datasets. However the nature of these links are not captured; in the case of the RDF export of UniProt they are all stated as rdfs:seeAlso. It is therefore hard to automatically reuse such links due to the differing natures of the datasets and meaning of the link. A case in point would be the relationships stated between UniProt – a protein sequence dataset – and Protein Data Bank (PDB) [2] – a 3-dimensional protein structure dataset. Due to the differences in the representations and the data gathering techniques the concepts in these datasets are not, in the strictest sense, equivalent, i.e. there is not a 1:1 isomorphism between the data instances. In particular, the UniProt record for the insulin receptor protein (P062135 ) links to 18 PDB entries. These in turn map back to six UniProt entries; one of which is the insulin receptor protein we started with. For users and applications to trust and reuse equivalence relationships, they need to understand what notion of equivalence is being claimed. UniProt, in their RDF export, weaken their links to rdfs:seeAlso to avoid making inaccurate claims, but this reduces the knowledge conveyed. At the other extreme, the datasets in the Linked Data Cloud tend to be very relaxed about their claims of “equivalence” and widely use, or misuse, the predicate owl:sameAs; typically they do not intend the strict semantics of owl:sameAs. As such, these links need to be used with caution. Such context will enable applications to choose which links to include. For example, for the vast majority of drug discovery research – almost all – it would be acceptable to use “equivalence” links between gene entries in Entrez Gene and protein entries in UniProt as genes are often used as proxies for the protein that they encode. For example, when searching for data about a target, it is common to enter the gene name as the search term. However, there are those who would require that such links are not included, e.g. working in very niche specialisms or on edge cases. 3 Describing Datasets and their Links For effective linking between datasets, and to enable trust in their use in applica- tions, it is essential to understand what has been linked and how. This requires descriptions of the datasets and the links themselves. 5 http://www.uniprot.org/uniprot/P06213 accessed Sept. 2013 4 C.Y.A. Brenninkmeijer et al. Fig. 1: Illustrating example links for a ChemSpider entry from three linksets connecting ChemSpider with DrugBank. Each linkset has a separate publisher using a different link predicate relationship. A dataset description is essential for data discovery and to enable consumers to use the data. It is a means to provide core metadata about a dataset, e.g. its title, description, and license information. It can also convey information about how to access the dataset, e.g. a SPARQL endpoint location, how the data is modelled, i.e. which vocabularies have been used, and key statistics about the data, e.g. number of triples, number of subjects, etc. The Vocabulary of Interlinked Datasets (VoID) [1] provides a vocabulary of terms and a deployment model for dataset descriptions. Where possible, VoID recommends re-using Dublin Core terms, e.g. for providing the title (dct:title) and license (dct:license). VoID itself provides predicates for expressing access and statistical information. A key feature of VoID is the ability to embed the dataset description with the data. This is achieved by the data linking back to its description using the void:inDataset predicate. VoID also introduces the notion of a Linkset which is a collection of links between a pair of datasets. The linkset description captures the context of the links, i.e. which datasets are linked using what predicate. Some benefits of providing separate linksets are that the linksets can develop independently of the datasets and even be provided by third-parties, as well as being used in co-reference services. Figure 1 illustrates four example links for a ChemSpider entry to the DrugBank dataset. These links are drawn from three distinct linksets published by different providers using a variety of link relationships. However, VoID does not prescribe which properties must be provided and those that are more optional. This makes the general re-use of VoID dataset descriptions difficult as there is no guarantee that the information you need will Computing Identity Co-Reference 5 be provided. For example, for the pharmaceutical companies involved with Open PHACTS it is essential to understand the licensing restrictions of the dataset, but this information may not be present in the dataset description. There is also no notion of capturing the version of a dataset, which is essential to know when linking between datasets. To overcome these challenges in the Open PHACTS project, we have defined a checklist of properties that must be provided [13]. We have also identified additional vocabulary terms for capturing the context of the linkset, e.g. the Provenance, Authoring and Version vocabulary (PAV) [6] is used to provide the version number of a dataset. 4 Identity Co-Reference Computation For systems such as the Open PHACTS Discovery Platform (OPSDP) [14], links are required between several datasets. However, it is not practical to require that each pair of datasets is directly related. As such we propose that co-reference identities can be transitively computed from those that are supplied. However, care needs to be taken to avoid computing inaccurate co-references as may result from a chain of links with varying meaning. In this section we detail some alter- native strategies, the problems with them, and outline the approach currently adopted by the Open PHACTS Identity Mapping Service (IMS) [4]. 4.1 Link Predicate A VoID linkset includes three key pieces of information: the dataset that is the subject of the link triples, the dataset that is the object of the link triples, and the predicate used in the links. Based on this information it is feasible to compute transitive co-reference links based on the properties of the link predicate. For example, given the linksets A −−−→ B and B −−−→ C which link the datasets p p A and B, and B and C respectively, with the link predicate p, and let p be the predicate owl:sameAs, then it follows through the properties of the link predicate that we have the linkset A −−−→ C which links datasets A and C. p However, as shown by Halpin et al [15] when owl:sameAs links are used they are often not truely equivalent. Additionally, as indicated in Section 2, many dif- ferent link predicates are in use in life sciences datasets. These predicates each have different properties, e.g. rdfs:seeAlso is neither transitive nor symmetric. As such, it is not possible to compute a complete network of co-reference iden- tifiers across the set of required datasets based on OWL reasoning over the link predicates. Therefore we need a custom approach to computing the transitive co-reference across datasets that requires more than just the link predicate as a means of control. 4.2 Linkset Justification The limitation of the linkset predicate approach stems from the generality of the linking predicate and thus the lack of domain knowledge that it conveys. One 6 C.Y.A. Brenninkmeijer et al. Term Justification Chemical entity The concepts linked represent the same chemical entity. sio:SIO 010004 Gene The concepts linked are conceptually the same gene. sio:SIO 010035 InChI Key The concepts linked have the same InChI Key. cheminf:CHEMINF 000059 Protein The concepts linked are conceptually the same protein. sio:SIO 010043 Protein coding gene A gene resource and a protein resource are being treated sio:SIO 000985 as being equivalent Table 1: A subset of the vocabulary terms used to capture the justification of a linkset and the operational equivalence that is interpreted. sio represents the Semantic Science Integrated Ontology namespace and cheminf the Chemical Information Ontology namespace. approach to overcome this, whilst still retaining the notion of a VoID linkset, would be to mint a new linking predicate for each notion of equivalence; these could be created as sub-properties of existing mapping predicates. However there is a major social barrier to such an approach; gaining consensus on the required linking predicates and updating the existing links in the datasets to use these new link predicates. As such, it is unlikely to gain traction. Another alternative is to annotate the linkset descriptions with additional contextual data; this enables the use of the existing links unchanged. We term this the justification for the linkset; the notion captured is the scientific inter- pretation of the operational equivalence applied by the linkset. For example, two chemical datasets, A and B, that are linked because they have the same InChI string would express this relationship in the linkset VoID header with the triples :A-B_Linkset void:linkPredicate skos:exactMatch ; bdb:linksetJustification cheminf:CHEMINF_000059 . where :A-B_Linkset is the resource that describes the linkset, the link predicate is declared to be skos:exactMatch, and the justification is specified using the BridgeDb vocabulary namespace (bdb6 ) with the value taken from the Chemical Information Ontology namespace (cheminf7 ). The linkset can be expressed as j A −−−→ B where the justification j is cheminf:CHEMINF 000059 and the link p predicate p is skos:exactMatch. The set of supported justifications within the Open PHACTS IMS can be found in [13]; a subset of which are included in Table 1. A key advantage of this approach is that it extends rather then changes the existing data. Based on the justification of linksets, we can compute transitive linksets. For example, we can generate a linkset between datasets A and C through some 6 http://vocabularies.bridgedb.org/ops to appear soon 7 http://semanticscience.org/resource/CHEMINF_000059 accessed Sept. 2013 Computing Identity Co-Reference 7 intermediary dataset B if there is a linkset between A and B and one between B and C such that both linksets have the same justification. Definition 1 formally gives the rule for computing transitive linksets based on their linkset justification. Note that we do not require that the linksets have the same link predicate. The resulting transitive linkset is given the weaker of the two link predicates with a hierarchy of owl:sameAs skos:exactMatch skos:closeMatch rdfs:seeAlso. Thus, if p was the link predicate owl:sameAs and r the link predicate rdfs:seeAlso, j the computed linkset A −−−→ C would have the link predicate rdfs:seeAlso. r Definition 1 (Transitive computation based on linkset justification). j j Given datasets A, B, and C, linksets A −−−→ B and B −−−→ C both with the p r justification j and link predicates p and r respectively then we can generate the linkset j – A −−−→ C if p r; r j – A −−−→ C if r ≺ p. p By iteratively applying the rule given in Definition 1 it is possible to compute chains of linksets that use the same justification. However it is possible to enter an infinite cycle; thus the IMS implementation prevents the same linkset being used more than once in a chain. As part of the provenance of the computed linkset, the linksets that are used to compute it are tracked. 4.3 Permitting Cross-type Equivalence Within the life sciences it is common to use gene names as proxies for protein names since gene names are shorter and more standardised. It is easy for a human to distinguish when this is being done but impossible for a computer to distinguish. A key requirement for the OPSDP is to permit a user to enter with a gene name that is then resolved to a URI for that gene. However, it should be possible to retrieve information about the target – a protein, or group of proteins – for which the entered gene name is a proxy. This means that it must be possible to state that a gene and a protein are operationally equivalent, i.e. to have equivalence across semantic types. This is straightforward using the linkset justification approach. We introduce a new justification for a protein coding gene (sio:SIO 000985), see Table 1. The complication comes when computing the transitive linksets to enable a user entered gene to relate to the protein information in each of the datasets. The transitive computation now needs to support equivalence across semantic types with different justifications. In particular we want to support chains of one or more protein linksets, a protein-gene linkset, and one or more gene linksets. It is important to prevent the use of protein-gene linksets to go from a gene to a 8 C.Y.A. Brenninkmeijer et al. protein and back again; this is to prevent a chain of links whereby we end up with protein X being claimed to be the same as protein Y due to crossing the semantic type boundary multiple times. Additional contextual information is required from the linkset description. Specifically, the semantic type of the data being linked. This needs to be cap- tured at the linkset level since datasets can contain multiple semantic types, e.g. ChEMBL, DrugBank and Ensembl [9]. Two additional predicates are used to capture the semantic type: bdb:subjectsType and bdb:objectsType. Note that these mirror the VoID predicates for specifying the datasets that are linked. Definition 2 extends the co-reference transitive computation rule given in Definition 1 to support cross-type mappings. Note that for simplicity we have omitted the link predicate from the rules in Definition 2. These are derived in the same way as for the rule given in Definition 1. The first clause of Definition 2 is a combination of the clauses in Definition 1, with the additional constraint that all of the datasets involved are of the same semantic type. The second clause allows for a linkset of the same semantic type and a linkset with a cross semantic type justification to be combined, with the resulting linkset being given the cross-type justification. Definition 2 (Cross-type transitive computation). Let A, B, and C be datasets, τ and τ 0 be semantic types with τ 6= τ 0 , and j and j 0 be linkset justifi- cations such that j links datasets with the same type and j 0 links datasets across types. Then the following two rules hold for transitive linkset computation: – Same semantic type and justification j j Aτ −−−→ Bτ ∧ Bτ −−−→ Cτ j Aτ −−−→ Cτ – Cross semantic type and justification j j0 j0 j Aτ −−−→ Bτ ∧ Bτ −−−→ Cτ 0 ∨ Aτ −−−→ Bτ 0 ∧ Bτ 0 −−−→ Cτ 0 j0 Aτ −−−→ Cτ 0 By iteratively applying Definition 2 it is possible to compute the co-reference of URIs across proteins and genes. Note that the definition only permits a single cross-type link justification, e.g. “protein coding gene”, to be used in any chain, although arbitrary numbers of same type justifications, e.g. “same protein” or “same gene”, can be applied. This is due to the consequence of the second rule being given the cross-type link justification. Thus, a chain of links resulting in inaccurate mappings is prevented. 5 Open PHACTS IMS Implementation The Open PHACTS IMS implementation is an extended version of BridgeDb to support cross-references over linked data sources, i.e. supporting the use Computing Identity Co-Reference 9 Fig. 2: Screenshot of the Open PHACTS Identity Mapping Service web interface showing the results for the UniProt entry for insulin receptor. of URIs to represent records in datasets. The source code is available from https://github.com/openphacts/IdentityMappingService and the service is accessible through the Open PHACTS API, https://dev.openphacts.org/. The IMS has implemented the transitive co-reference computation rule given in Definition 2. Computed linksets are created with dataset descriptions giving full details of the datasets linked and the justification for the link. A screenshot of the web interface to the IMS is shown in Figure 2 with the results for a look-up for the UniProt insulin receptor URI. 5.1 Result of Co-reference Computation For the 1.3 release of the OPSDP, the IMS was supplied with 104 linksets from 9 providers linking 37 datasets and containing 7,096,712 links. These are shown in the visualisation in Figure 3a; nodes represent datasets and edges the linksets between them, with the colour signifying the linkset justification. Note that the large number of linksets is a consequence of splitting links based on their semantic type – resulting in multiple linksets between some datasets – and in the case of Ensembl the linksets were further split by species with 12 species covered. The visualisation highlights the Open PHACTS design decision to use a small number of datasets as mapping centres; chemical alignment is performed through the Open PHACTS Chemical Registration Service [19] – OCS in the figure – with a few through the Human Metabolome Database (HMDB) [23] – Hm in the figure – proteins centre around UniProt – S in the figure – and genes around Ensembl – En in the figure. Following the transitive co-reference computation, there are 883 linksets con- taining a total of 17,383,846 links. These are shown in the visualisation in Fig- ure 3b. The left side of the visualisation is dominated by the linksets that match InChIs via the Open PHACTS Chemical Registration Service while the right side 10 C.Y.A. Brenninkmeijer et al. (a) Visualisation showing connectivity of the linksets provided as input to the IMS. (b) Visualisation showing connectivity of the linksets after co-reference computation. Legends Colour Justification URI Red InChI cheminf:CHEMINF 000059 Purple Chemical entity sio:SIO 010004 Green Protein sio:SIO 010043 Orange Gene sio:SIO 010035 Blue Protein coding gene sio:SIO 000985 Light blue Pathway sio:SIO 001107 Yellow Pathway name edam:data 2342 Code Dataset Code Dataset AERS Adverse events reporting system Hm Human metabolite database Bg BioGrid I InterPro Ca Chemical abstracts service Ip International protein index Chebi Chemical entities of biological interest L NCBI Gene Ch.M ChEMBL molecule MGI Mouse genome informatics Ch.T ChEMBL target MSH Medical subject headings Ch.TC ChEMBL target components Pw Pathway ontology Ck KEGG Compound Om Online mendelian inheritance in man Cw ConceptWiki OCS OPS Chemical registration service Cpc PubChem Compound Pd Protein databank Cs ChemSpider Q NCBI Reference Sequence Database D Saccharomyces genome database R Rat genome database Db.D Drugbank drugs S UniProt Db.T Drugbank targets Ug UniGene Em European nucleotide archive Up UniParc En Ensembl Wi Wikipedia F FlyBase Wp Wikipathways Hac HGNC accession number Z Zebrafish information network Fig. 3: Visualisations showing connectivity of the linksets. Computing Identity Co-Reference 11 is dominated by protein coding genes due to the application of the cross-type co-reference rule. The visualisation highlights a second Open PHACTS design decision; to use ConceptWiki8 – Cw in the figure – as the source for text-to-URI translation. This is shown by its connectivity to every other dataset; one of the reasons for computing the co-references. As a consequence of these co-reference computations, users of the OPSDP are able to enter with any chemical, protein, or gene URI known to the IMS and retrieve information about chemicals or targets. 5.2 Evaluation of Co-reference Computation There are several benefits to the co-reference computation approach that has been implemented as part of the OPSDP IMS; not least of which is the increased inter-connectivity across the datasets. First it eases the burden on the dataset providers; with only a small number of additional metadata triples being required to provide justifications and semantic types. This allows systems such as the OPSDP to exploit the fact that many datasets already contain links within their data to instances in other datasets; particularly those who publish their data as RDF. However it cannot be expected that they can link to every dataset that is required for every possible use case. Thus, by applying the co-reference computation we are able to infer additional connectivity across the data. Second the co-reference computation tightly controls what can be equated, e.g. only chemical entities can be related through an InChI, and which justi- fications are allowed to cross semantic types, e.g. protein coding gene relating genes and proteins. These safeguards ensure that the result of the co-reference computation matches the expectations of the domain scientists. User evaluation of the results of the co-reference computation is on-going with domain scientists. The IMS has been successfully deployed by the OPSDP enabling the integration of data across the data sources shown in Figure 3. 6 Related Work Identifiers.org [18] provide a linked data identifier for many life sciences datasets. This consists of a URI constructed according to the rules of the identifier scheme of the underlying dataset. However, they do not attempt to identify co-referent identifiers. The Identifiers.org approach is complimentary to the co-reference work reported here. The IMS accepts and returns the Identifiers.org form of URI for each of the datasets. Bio2RDF [5] is another closely related approach. Bio2RDF republishes ex- isting datasets as RDF where the source data has been originally published as database dumps or in other formats. Where the original datasets contain links to other datasets these are published with the RDF. These links could be used 8 http://ops.conceptwiki.org/ accessed September 2013 12 C.Y.A. Brenninkmeijer et al. as a source of mapping information for the IMS and the IMS already returns the Bio2RDF identifier as an alternative URI. BridgeDb [16] and sameas.org [11] are co-reference services similar to the IMS. The goal of sameas.org is to ingest the links of as many linked data sources as possible as such it has a much broader coverage of topics than the IMS. It provides an API for returning all known equivalent URIs. However, there is no curation of the links nor context for the data. Additionally, to the best of our knowledge, there is no transitive computation across the co-referent URIs. BridgeDb is a life sciences focused database identity cross-reference service. How- ever, it does not track the equivalent URIs for the database entries. It also does not characterise the database cross-reference nor their context. The IMS is an extension of BridgeDb to provide a URI look-up service that understands the context of the links as well as the linking relationship. Another closely related area of research is focused on generating tools for identifying links between pairs of datasets; either at the schema level Ontology Matching or at the instance level. The latter of these is most relevant to the work in this paper. Since 2009 there has been an instance matching track9 in the an- nual ontology matching competition10 to compare such tools. The most recent re- sults are available from http://www.instancematching.org/oaei/imei2013/ results.html. The links generated by these instance matching tools could be used as input to the IMS. 7 Conclusions In this paper we presented the challenges for co-reference across life sciences datasets that stem from each dataset using their own identifier scheme. We have argued that there is not a one size fits all notion of equivalence across concepts in these different datasets since they model the data at different levels of granularity, e.g. should a drug entry be equated to an entry about the chemical compound. Additionally, users of the data want to apply varying notions of equivalence based on the task they are performing, e.g. should genes and proteins always be equivalent. As such we have proposed that the notion of operational equivalence should be captured in the linksets that relate a pair of datasets as the justification for the linkset. The advantage of stating it as a justification rather than a mapping predicate is that existing linksets can be easily extended. Due to the fact that each pair of datasets is not related in the web of data, we have developed rules for transitively computing co-reference across datasets. To support scenarios where genes and proteins should be equated, the co-reference computation allows the crossing of semantic types. We presented a rule for pre- venting undesired co-references being computed whilst ensuring that concepts within a given type are completely covered. This has been implemented in the 9 http://www.instancematching.org/oaei/imei2013/results.html accessed November 2013 10 http://oaei.ontologymatching.org/ accessed November 2013 Computing Identity Co-Reference 13 Open PHACTS Identity Mapping Service. Details of how the IMS is used within the OPSDP can be found in [14,4]. As future work we will allow applications to apply different scientific lenses [3] over the co-reference network to vary the notion of operational equivalence being applied, i.e. to activate different combinations of linksets based on their justification. These lenses depend upon the justifications used to compute the co-references. Acknowledgements The research leading to these results has received support from the Innovative Medicines Initiative Joint Undertaking under grant agreement number 115191, resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/2007- 2013) and EFPIA compa- nies’ in kind contribution. References 1. Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing Linked Datasets with the VoID Vocabulary. Note, W3C (Mar 2011), http://www.w3.org/TR/void/ 2. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The Protein Data Bank. Nucleic acids research 28(1), 235–42 (Jan 2000), http://www.pubmedcentral.nih.gov/articlerender. fcgi?artid=102472&tool=pmcentrez&rendertype=abstract 3. Brenninkmeijer, C.Y.A., Evelo, C., Goble, C., Gray, A.J.G., Groth, P., Pettifer, S., Stevens, R., Williams, A.J., Willighagen, E.L.: Scientific Lenses over Linked Data: An approach to support task specific views of the data. A vision. In: Proceedings of 2nd International Workshop on Linked Science 2012 (LISC2012) Colocated 11th International Semantic Web Conference 2012. CEUR-WS.org, Boston, MA, USA (2012), http://ceur-ws.org/Vol-951/paper5.pdf 4. Brenninkmeijer, C.Y.A., Goble, C., Gray, A.J.G., Groth, P., Loizou, A., Pettifer, S.: Including Co-referent URIs in a SPARQL Query. In: 4th International Workshop on Consuming Linked Data. Sydney, Australia (Jul 2013) 5. Callahan, A., Cruz-toledo, J., Ansell, P., Dumontier, M.: Bio2RDF Release 2 : Im- proved Coverage , Interoperability. In: ESWC 2013. pp. 200–212. Springer, Mont- pellier, France (2013) 6. Ciccarese, P., Soiland-Reyes, S., Belhajjame, K., Gray, A.J.G., Goble, C., Clark, T.: PAV ontology: Provenance, Authoring and Versioning. arXiv.org (Apr 2013), http://arxiv.org/abs/1304.7224, submitted to Journal of Biomedical Semantics 7. Clark, T., Martin, S., Liefeld, T.: Globally distributed object identification for biological knowledgebases. Briefings in bioinformatics 5(1), 59–70 (Mar 2004), http://www.ncbi.nlm.nih.gov/pubmed/15153306 8. Fernández-Suárez, X.M., Galperin, M.Y.: The 2013 Nucleic Acids Research Database Issue and the online molecular biology database collection. Nucleic acids research 41(Database issue), D1–7 (Jan 2013), http://nar.oxfordjournals.org/ content/early/2012/11/30/nar.gks1297 14 C.Y.A. Brenninkmeijer et al. 9. Flicek, P., Ahmed, I., Amode, M.R., Barrell, D., Beal, K., Brent, S., Carvalho- Silva, D., Clapham, P., Coates, G., Fairley, S., Fitzgerald, S., Gil, L., Garcı́a- Girón, C., Gordon, L., Hourlier, T., Hunt, S., Juettemann, T., Kähäri, A.K., Keenan, S., Komorowska, M., Kulesha, E., Longden, I., Maurel, T., McLaren, W.M., Muffato, M., Nag, R., Overduin, B., Pignatelli, M., Pritchard, B., Pritchard, E., Riat, H.S., Ritchie, G.R.S., Ruffier, M., Schuster, M., Sheppard, D., Sobral, D., Taylor, K., Thormann, A., Trevanion, S., White, S., Wilder, S.P., Aken, B.L., Birney, E., Cunningham, F., Dunham, I., Harrow, J., Herrero, J., Hubbard, T.J.P., Johnson, N., Kinsella, R., Parker, A., Spudich, G., Yates, A., Zadissa, A., Searle, S.M.J.: Ensembl 2013. Nucleic acids research 41(Database issue), D48–55 (Jan 2013), http://www.pubmedcentral.nih.gov/articlerender.fcgi? artid=3531136&tool=pmcentrez&rendertype=abstract 10. Gaulton, A., Bellis, L.J., Bento, a.P., Chambers, J., Davies, M., Hersey, A., Light, Y., McGlinchey, S., Michalovich, D., Al-Lazikani, B., Overington, J.P.: ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic acids research 40(Database issue), D1100–7 (Jan 2012), http://www.pubmedcentral.nih.gov/ articlerender.fcgi?artid=3245175&tool=pmcentrez&rendertype=abstract 11. Glaser, H., Jaffri, A., Millard, I.: Managing Co-reference on the Semantic Web. In: WWW2009 Workshop: Linked Data on the Web (LDOW2009). Madrid, Spain (Apr 2009) 12. Goble, C., Stevens, R.: State of the nation in data integration for bioinformatics. Journal of biomedical informatics 41(5), 687–93 (Oct 2008), http://www.ncbi. nlm.nih.gov/pubmed/18358788 13. Gray, A.J.G.: Dataset descriptions for the Open Pharmacological Space. Working draft, Open PHACTS (Sep 2013), http://www.openphacts.org/specs/datadesc 14. Gray, A.J.G., Groth, P., Loizou, A., Askjaer, S., Brenninkmeijer, C.Y.A., Burger, K., Chichester, C., Evelo, C.T., Goble, C.A., Harland, L., Pettifer, S., Thomp- son, M., Waagmeester, A., Williams, A.J.: Applying linked data approaches to pharmacology: Architectural decisions and implementation. Semantic Web (2014), http://iospress.metapress.com/index/J3J12776V103821U.pdf 15. Halpin, H., Hayes, P.J., McCusker, J.P., McGuinness, D.L., Thompson, H.S.: When owl:sameAs Isn’t the Same: An Analysis of Identity in Linked Data. In: Inter- national Semantic Web Conference (1). LNCS, vol. 6496, pp. 305–320. Springer, Shanghai, China (Nov 2010) 16. van Iersel, M.P., Pico, A.R., Kelder, T., Gao, J., Ho, I., Hanspers, K., Con- klin, B.R., Evelo, C.T.: The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services. BMC Bioinformatics 11(5) (Jan 2010), http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 2824678&tool=pmcentrez&rendertype=abstract 17. Jupp, S., Malone, J., Bolleman, J., Brandizi, M., Davies, M., Garcia, L., Gaulton, A., Gehant, S., Laibe, C., Redaschi, N., Wimalaratne, S., Martin, M., Le Novere, N., Parkinson, H., Birney, E., Jenkinson, A.: The EBI RDF platform: Linked open data for the life sciences. Bioinformatics Application Note (2014), accepted for publication November 2013 18. Juty, N., Le Novère, N., Laibe, C.: Identifiers.org and MIRIAM Registry: community resources to provide persistent identification. Nucleic acids re- search 40(Database issue), D580–6 (Jan 2012), http://nar.oxfordjournals.org/ content/40/D1/D580 19. Karapetyan, K., Tkachenko, V., Batchelor, C., Sharpe, D., Williams, A.J.: Rsc chemical validation and standardization platform: A potential path to quality- Computing Identity Co-Reference 15 conscious databases. In: 245th American Chemical Society National Meeting and Exposition. New Orleans, LA, USA (April 2013) 20. Knox, C., Law, V., Jewison, T., Liu, P., Ly, S., Frolkis, A., Pon, A., Banco, K., Mak, C., Neveu, V., Djoumbou, Y., Eisner, R., Guo, A.C., Wishart, D.S.: DrugBank 3.0: a comprehensive resource for ’omics’ research on drugs. Nucleic acids research 39(Database issue), D1035–41 (Jan 2011), http://www.pubmedcentral.nih.gov/ articlerender.fcgi?artid=3013709&tool=pmcentrez&rendertype=abstract 21. Pence, H.E., Williams, A.J.: ChemSpider: an online chemical information resource. Journal of Chemical Education 87(11), 10–11 (2010), http://pubs.acs.org/doi/ abs/10.1021/ed100697w 22. The UniProt Consortium: Update on activities at the Universal Protein Re- source (UniProt) in 2013. Nucleic acids research 41(Database issue), D43– 7 (Jan 2013), http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 3531094&tool=pmcentrez&rendertype=abstract 23. Wishart, D.S., Tzur, D., Knox, C., Eisner, R., Guo, A.C., Young, N., Cheng, D., Jewell, K., Arndt, D., Sawhney, S., Fung, C., Nikolai, L., Lewis, M., Coutouly, M.A., Forsythe, I., Tang, P., Shrivastava, S., Jeroncic, K., Stothard, P., Amegbey, G., Block, D., Hau, D.D., Wagner, J., Miniaci, J., Clements, M., Gebremedhin, M., Guo, N., Zhang, Y., Duggan, G.E., Macinnis, G.D., Weljie, A.M., Dowlatabadi, R., Bamforth, F., Clive, D., Greiner, R., Li, L., Marrie, T., Sykes, B.D., Vogel, H.J., Querengesser, L.: HMDB: the Hu- man Metabolome Database. Nucleic acids research 35(Database issue), D521– 6 (Jan 2007), http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid= 1899095&tool=pmcentrez&rendertype=abstract