=Paper=
{{Paper
|id=None
|storemode=property
|title=Improved Dataset Coverage and Interoperability with Bio2RDF Release 2
|pdfUrl=https://ceur-ws.org/Vol-952/paper_18.pdf
|volume=Vol-952
|dblpUrl=https://dblp.org/rec/conf/swat4ls/CallahanCAKTD12
}}
==Improved Dataset Coverage and Interoperability with Bio2RDF Release 2==
<pdf width="1500px">https://ceur-ws.org/Vol-952/paper_18.pdf</pdf>
<pre>
         Improved dataset coverage and interoperability with
                        Bio2RDF Release 2

    Alison Callahan1, Jose Cruz-Toledo1, Peter Ansell2, Dana Klassen3, Giovanni Tum-
                            marello4 and Michel Dumontier1§
1
 Department of Biology, Carleton University, Ottawa, Canada, 2Microsoft QUT eResearch
Centre, Queensland University of Technology, Australia, 3Digital Enterprise Research Institute,
National University of Ireland, Galway, Ireland 4SindiceTech, Galway, Ireland
                                    §
                                     Corresponding author


         Abstract. Bio2RDF is an open source project that uses Semantic Web technol-
         ogies to create and provide the largest network of Linked Data for the life
         sciences. Here, we present the second release of the Bio2RDF project which
         features updated, open-source scripts, a resource registry for IRI mapping and
         normalization, dataset provenance, data metrics, downloadable RDF data files
         and Virtuoso SPARQL endpoints. We describe dataset connectivity, assisted
         SPARQL queries with context-aware SPARQLed, and mashup capability using
         the Sig.ma search engine. We discuss updates to the Bio2RDF project in the
         context of other related resources as well as future improvements.

         Keywords: semantic web, linked data, life sciences data, SPARQL


1        Introduction


In this post-genomic-information era, biological researchers are often confronted with
the inevitable and unenviable task of having to integrate their experimental results
with those of others. This task usually involves a tedious manual search and assimila-
tion of often isolated and diverse collections of life sciences data hosted by multiple
independent providers including organizations such as the National Center for Bio-
technology Information (NCBI) 1 and the European Bioinformatics Institute (EBI) 2
which provide dozens of user-submitted and curated data, as well as smaller institu-
tions such as the Donaldson group which publishes iRefIndex[1], a database of mole-
cular interactions aggregated from 13 data sources. While these mostly isolated silos
of biological information occasionally provide links between their records (e.g. Uni-
Prot links its entries to hundreds of other databases 3), they are typically serialized in
either HTML tags or in flat file data dumps that lack the semantic richness required to
serialize the intent of the linkage between data records. With thousands of biological
1
  http://www.ncbi.nlm.nih.gov/
2
  http://www.ebi.ac.uk/
3
  http://www.uniprot.org/database/
databases4,5 and hundreds of thousands if not millions of datasets, our ability to find
relevant data is hampered by non-standard database interfaces and an enormous num-
ber of haphazard data formats[2]. Moreover, metadata about these biological data
providers (dataset source data information, dataset versioning, licensing information,
date of creation, etc.) is often difficult to obtain. Taken together, our inability to easily
navigate through available data presents an overwhelming barrier to their reuse.
         Bio2RDF is an open source project that uses Semantic Web technologies to
make possible the distributed querying of integrated life sciences data. Since its incep-
tion[3], Bio2RDF has made use of the Resource Description Framework (RDF) and
the RDF Schema (RDFS) to unify the representation of data obtained from diverse
(molecules, enzymes, pathways, diseases, etc.) and heterogeneously formatted biolog-
ical data (e.g. flat-files, tab-delimited files, SQL, dataset specific formats, XML etc.).
Once converted to RDF, this biological data can be queried using the powerful
SPARQL Protocol and RDF Query Language (SPARQL), which can be used to fede-
rate queries across multiple SPARQL-compliant databases (a.k.a. SPARQL end-
points).
         Although several efforts for provisioning linked life data exist such as Neu-
rocommons[4], LinkedLifeData[5], W3C HCLS 6, Chem2Bio2RDF[6] and BioLOD7,
Bio2RDF stands out for several reasons: i) Bio2RDF is open source and freely availa-
ble to use, modify or redistribute, ii) it acts on a set of basic guidelines to produce
syntactically interoperable linked data across all datasets, iii) does not attempt to mar-
shal data into a single global schema, iv) provides a federated network of SPARQL
endpoints and v) provisions the community with an expandable global network of
mirrors that host Bio2RDF datasets.
        Here we present Bio2RDF Release 2, a significant update from past practice
that considerably increases the level of syntactic interoperability across datasets
through a script-directed IRI normalization that queries a central dataset registry. We
also introduce a new model for data item-level provenance and describe new metrics
for linked datasets that guide querying and provide high-level descriptions of datasets.
We characterize dataset connectivity, assisted SPARQL queries with context-aware
SPARQLed, and data mash-up capability using the Sig.ma 8 search engine.


2      Methods

2.1    Resource Registry
A resource registry composed of vocabularies (e.g. Gene Ontology, ChEBI, etc.) and
datasets (e.g. RefSeq) was developed to facilitate dataset identification and inter-
dataset mapping. Each item lists a preferred short name (a.k.a. namespace; e.g. „pdb‟

4
  http://nar.oxfordjournals.org/content/40/D1.toc
5
  http://www.freebase.com/view/base/bio2rdf/views/bm
6
  http://www.w3.org/blog/hcls/
7
  http://biolod.org/
8
  http://sig.ma/
for the Protein DataBank), resource synonyms (e.g. ncbigene, entrez gene, entrez-
gene/locuslink for the NCBI‟s Gene database), as well as primary and secondary base
Internationalized Resource Identifiers (IRIs) used within the datasets
(e.g.http://purl.obolibrary.org/obo/,                        http://purl.org/obo/owl/,
http://purl.obofoundry.org/namespace, etc). The resource registry is currently availa-
ble as part of the PHP-LIB project 9.

2.2       Identifiers
Bio2RDF data items are identified by formulating an Internationalized Resource Iden-
tifier (IRI) consisting of the following pattern:

                            http://bio2rdf.org/namespace:identifier

where „namespace‟ is the preferred short name of a biological dataset as found in the
resource registry (section 2.1) and the „identifier‟ is the unique string used by the
source provider. For example, the Protein DataBank (PDB) features a structure con-
taining an adenine riboswitch complex, which it identifies by the accession “1Y26”.
In the registry, the PDB is assigned the namespace “pdb” and thus, its corresponding
Bio2RDF IRI is

                                  http://bio2rdf.org/pdb:1Y26

Two additional identifier patterns are used for resources introduced as a product of
RDFization. First, namespace_vocabulary:identifier, is used to name dataset-specific
types and predicates. For example, the chemoinformatics resource DrugBank contains
data about drugs and their targets, and these two types have the following IRIs:

                         http://bio2rdf.org/drugbank_vocabulary:Drug
                        http://bio2rdf.org/drugbank_vocabulary:Target

The second namespace pattern, namespace_resource:identifier, is used to designate
additional resources that were introduced to convert (unidentified) n-ary relations into
an identified object with a set of binary relations. For example, the Pharmacogenom-
ics Knowledge Base (PharmGKB) describes associations between diseases, genes and
drugs, but does not specify an identifier for either of these associations, and hence we
assign a new stable identifier for each, such as

           http://bio2rdf.org/pharmgkb_resource:association_PA445019_PA126

for the gene-disease association between cytochrome P450, family 2, subfamily C,
polypeptide 9 (pharmgkb:PA126) and Myocardial Infarction (pharmgkb:PA445019).


9
    https://github.com/micheldumontier/php-lib/blob/master/ns.php
2.3    Bio2RDF’s Open Scripts
At its core, Bio2RDF is a set of conventions to generate and provide Linked Data.
These best practices have been inspired by the Banff Manifesto 10, Tim Berner-Lee‟s
design principles 11 and the collective experience of the Bio2RDF community. In
2012, we consolidated the set Bio2RDF open source12 scripts into a single GitHub
repository (bio2rdf-scripts) 13 , which facilitates collaborative development through
project forking, pull requests, code commenting, and merging. Thirty PHP scripts,
one Java program and a Ruby gem are now available for any use (including commer-
cial), modification and redistribution by anyone wishing to generate RDF data on
their own, or to improve the quality of RDF conversions currently used in Bio2RDF.
          Nearly every script has now been updated to make use of the resource regi-
stry, thereby ensuring a high level of syntactic interoperability between the generated
linked data sets. Scripts that have not yet been updated include the NCBO Bioportal
collection, GenBank and RefSeq. These transformation scripts are programmatically
restricted to only create valid Bio2RDF resources and only make use of preferred
namespace items in a dataset as found in our resource registry.


2.4    Provenance
Previous iterations of Bio2RDF scripts lacked a framework with which to record
provenance (metadata about the creator, creation date and origin) for Bio2RDF data-
sets. Upon execution, Bio2RDF scripts now generate provenance records using the
W3C Vocabulary of Interlinked Datasets (VoID), the Provenance vocabulary (PROV)
and Dublin Core vocabulary. Each data item is linked to a provenance object that
indicates the source of the data, the time at which the RDF was generated, licensing
(if available from data source provider), the SPARQL endpoint in which the resource
can be found, and the downloadable RDF file where the data item is located. Each
dataset provenance object has a unique IRI and label based on the dataset name and
creation date. The date-specific dataset IRI is linked to a unique dataset IRI using the
W3C PROV predicate „wasDerivedFrom‟ such that one can query the dataset
SPARQL endpoint to retrieve all provenance records for datasets created on different
dates. Figure 1 shows an example provenance record for the NLM Medical Subject
Headings (MeSH) dataset. Each resource in the dataset is linked the date-unique data-
set IRI that is part of the provenance record using the VoID „inDataset‟ predicate.
Other important features of the provenance record include the use of the Dublin Core
„creator‟ term to link a dataset to the script on Github that was used to generate it, the
VoID predicate „sparqlEndpoint‟ to point to the dataset SPARQL endpoint, and VoID
predicate „dataDump‟ to point to the data download URL.


10
   https://sourceforge.net/apps/mediawiki/bio2rdf/index.php?title=Banff_Manifesto
11
   http://www.w3.org/DesignIssues/LinkedData.html
12
   http://opensource.org/licenses/MIT
13
   http://github.com/bio2rdf/bio2rdf-scripts
Figure 1 Example provenance record for the MeSH dataset


2.5   SPARQL Endpoints


Each dataset was loaded into a separate instance of OpenLink Virtuoso Community
Edition build 06.01.3127 with the faceted browser, SPARQL 1.1 query federation and
Cross-Origin Resource Sharing enabled.


2.6   Dataset metrics

Dataset metrics provide an important overview of dataset contents, which can be used
to support query formulation or monitor changes to datasets over time. We apply
three different dataset metrics programs (A-C below) to each dataset. These metrics
are serialized as RDF and loaded into their own graphs at each dataset SPARQL end-
point.
A) Nine dataset metrics are computed14 using SPARQL queries that obtain the follow-
ing information
 1. total number of triples
 2. number of unique subjects
 3. number of unique predicates
 4. number of unique objects
 5. number of unique types
 6. unique predicate-object links and their frequencies
 7. unique predicate-literal links and their frequencies
 8. unique subject type-predicate-object type links and their frequencies
 9. unique subject type-predicate-literal links and their frequencies

B) Namespace-related metrics are tabulated including

 1. total number of references to a namespace
 2. total number of inter-namespace references
 3. total number of inter-namespace-predicate references

C) Data graph summaries[7] required for query formulation using SparQLed 15 are
generated. The data graph summaries include metrics regarding the frequency and
relationship among types via predicates. The data graph summaries are serialized in
RDF using the Dataset Analytics Vocabulary16.


3       Results

3.1     Bio2RDF Release 2
Nineteen datasets, including 5 new datasets, were generated as part of the Bio2RDF 2
release (Table 1). Several of the new datasets are themselves collections of datasets
that are now available as one resource. For instance, iRefIndex consists of 13 datasets
(BIND, BioGRID, CORUM, DIP, HPRD, InnateDB, IntAct, MatrixDB, MINT,
MPact, MPIDB, MPPI and OPHID) while NCBO‟s Bioportal collection currently
consists of 100 OBO ontologies including ChEBI, Protein Ontology and the Gene
Ontology. We also have 10 additional updated scripts that are currently generating
updated datasets and SPARQL endpoints to be available with the next release: Uni-
Prot (including UniRef and UniParc), UniSTS, PubMed, PDB, RefSeq, PubChem,
ChemBL, DBPedia, GenBank, MGI and PathwayCommons. Several of these datasets
are the most resource intensive to generate and load, hence their later release sche-
dule.
          Each dataset has been loaded into a dataset specific SPARQL endpoint using
Openlink Virtuoso version 6.1.6. SPARQL endpoints are available at

14
   https://github.com/bio2rdf/bio2rdf-scripts/blob/master/statistics/bio2rdf_stats_virtuoso.php
15
   https://github.com/sindicetech/sparqled
16
   http://vocab.sindice.net/analytics#
http://[namespace].bio2rdf.org. For example, the Saccharomyces Genome Database
(SGD) SPARQL endpoint is available at http://sgd.bio2rdf.org. All updated Bio2RDF
linked data and their corresponding Virtuoso DB files are available for download at
http://download.bio2rdf.org. Pre-Release 2 Bio2RDF datasets are also available for
download.

 Table 1. Bio2RDF Release 2 datasets and selected dataset metrics. Dataset names annotated
                        with * are new to the Bio2RDF network.

Dataset                Namespace    # of triples   # of unique    # of unique     # of unique
                                                   subjects       predicates      objects
Affymetrix             affymetrix    44469611        1370219           79          13097194
Biomodels*             biomodels       589753         87671            38           209005
Comparative Tox-       ctd           141845167      12840989           27          13347992
icogenomics Data-
base
DrugBank               drugbank       1121468         172084           75           526976
NCBI Gene              ncbigene      394026267      12543449           60         121538103
Gene       Ontology    goa           80028873        4710165           28          19924391
Annotations
HUGO Gene No-          hgnc            836060         37320            63           519628
menclature Com-
mittee
Homologene             homologene     1281881          43605           17           1011783
InterPro*              interpro        999031          23794           34            211346
iProClass              iproclass     211365460      11680053           29          97484111
iRefIndex              irefindex     31042135        1933717           32           4276466
Medical      Subject   mesh           4172230         232573           60           1405919
Headings
National Center for    ncbo           15384622       4425342          191          7668644
Biomedical Ontol-
ogy*
National       Drug    ndc            17814216        301654           30           650650
Code Directory*
Online Mendelian       omim           1848729         205821           61          1305149
Inheritance in Man
Pharmacogenomics       pharmgkb       37949275       5157921           43          10852303
Knowledge Base
SABIO-RK*              sabiork        2618288         393157           41           797554
Saccharomyces          sgd            5551009         725694           62          1175694
Genome Database
NCBI Taxonomy          taxon         17814216        965020            33          2467675
Total                  19           1010758291      57850248          1003        298470583
3.2    Namespace-based dataset connectivity
Figure 2 shows the connectivity between Bio2RDF datasets based on namespace-
namespace linkages. Highlighted are core Bio2RDF datasets that make reference to
hundreds of other datasets.


Figure 2 A network-based visualization of Bio2RDF namespace connectivity. Selected nodes
indicate Bio2RDF datasets, as identified from provenance descriptions. Figure produced using
IBM‟s Many Eyes (http://www-958.ibm.com).


3.3    Metrics-informed querying
Dataset metrics (section 2.6) serve as an overview of the contents of a dataset and can
be used to guide querying with SPARQL. Table 2 shows values for the type-relation-
type metric in the DrugBank dataset. In the first row we observe that 11,512 unique
pharmaceuticals are paired with 56 different units using the „form‟ predicate, indicat-
ing the enormous number of possible formulations. Further in the list, we see that
1074 unique drugs are involved in 10891 drug-drug interactions, most of these arising
from FDA drug product labels.
    Table 2. Selected DrugBank dataset metrics describing the frequencies of type-relation-type
           occurrences. The namespace for subject types, predicates, and object types is
                            „http://bio2rdf.org/drugbank_vocabulary:‟
                                                                                         Object
 Subject Type            Subject Count      Predicate           Object Type              Count
 Pharmaceutical                  11512      form                Unit                         56
 Drug-Transporter-
 Interaction                        1440    drug                Drug                         534
 Drug-Transporter-
 Interaction                        1440    transporter         Target                       88
 Drug                               1266    dosage              Dosage                      230
 Patent                             1255    country             Country                       2
 Drug                               1127    product             Pharmaceutical            11512
 Drug                               1074    ddi-interactor-in   Drug-Drug-Interaction     10891
 Drug                                532    patent              Patent                     1255
 Drug                                277    mixture             Mixture                    3317
 Dosage                              230    route               Route                        42
 Drug-Target-
 Interaction                          84    target              Target                        43

The type-relation-type metric gives the necessary information to understand how
objects are related to one another in the RDF graph. It can also inform the construc-
tion of an immediately useful SPARQL query, without losing time generating „explo-
ratory‟ queries to become familiar with the dataset model. For instance, the above
table suggests that in order to retrieve drugs that are involved in drug-drug interac-
tions, one should specify the „ddi-interactor-in‟ predicate, to link a drug to its drug-
drug interaction(s):

PREFIX drugbank_vocabulary: <http://bio2rdf.org/drugbank_vocabulary:>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?ddi ?d1name
WHERE {
           ?ddi a drugbank_vocabulary:Drug-Drug-Interaction .
           ?d1 drugbank_vocabulary:ddi-interactor-in ?ddi .
           ?d1 rdfs:label ?d1name?.
           ?d2 drugbank_vocabulary:ddi-interactor-in ?ddi .
           ?d2 rdfs:label ?d2name.
           FILTER (?d1 != ?d2)
}

Some of the results of this query are listed in Table 3.
  Table 3. Partial and collated results from a query to obtain drug-drug interactions from the
                            Bio2RDF DrugBank SPARQL endpoint

Drug-Drug Interaction                            DDI Drug Participants
drugbank_resource:DB00001_DB01381                Ginkgo biloba, Lepirudin
drugbank_resource:DB00008_DB01223                Peginterferon alfa-2a, Aminophylline
drugbank_resource:DB00013_DB01404                Ginseng, Urokinase
drugbank_resource:DB00015_DB00208                Reteplase, Ticlopidine
drugbank_resource:DB00021_DB01409                Tiotropium, Secretin
drugbank_resource:DB00031_DB00055                Drotrecoginalfa, Tenecteplase
drugbank_resource:DB00041_DB01013                Aldesleukin, Clobetasol
drugbank_resource:DB00047_DB00195                Betaxolol, Insulin Glargine
drugbank_resource:DB00054_DB00775                Tirofiban, Abciximab
drugbank_resource:DB00059_DB00072                Trastuzumab, Betamethasone


3.4    Context-Aware SPARQL assistance with SPARQLed

SPARQLed is an open-source web-application that provides context sensitive IRI
suggestions while formulating SPARQL queries. In particular, once a variable has
been linked to a predicate or type, it is possible to deduce which other relations or
types are applicable based on the inferred position of the object in the type-relation-
type graph. Figure 3 shows the grammar-sensitive and context aware formulation of a
query to retrieve drug-gene associations from PharmGKB, where once the variable ?s
is restricted to Drug-Gene-Association (Figure 3A) the only predicates available to
use are listed in the suggestion box. Completion of the query (Figure 3B) to obtain the
drug and gene names yields the results in Figure 3C.
Figure 3 Using SPARQLed context-aware SPARQL assisted querying. (A) Selecting ctrl-shift
space shows available predicates for a subject that has been constrained to a PharmGKB drug-
gene association. (B) A SPARQLed-assisted query to get the drug and gene name. (C) First
four drug-gene associations from the query in (B).


3.5    Virtuoso faceted search and query builder
By default, Virtuoso comes with a faceted browser that facilitates search and querying
across a single SPARQL endpoint. The faceted search is initialized with a keyword
(e.g. “drugbank” against the DrugBank endpoint – which appears in the rdfs:label of
every drugbank resource). The search identifies 170,336 page-ranked hits that can be
further categorized by type by selecting “Types” in the Entity Relations Navigation
panel. The results include 32 types including drugs, drug interactions, targets, experi-
mental and computed properties (Figure 4). Selecting any one of these will provide a
list of specific instances of those types.


Figure 4 Types matching a search of “drugbank” on the DrugBankVirtuoso endpoint.

However, the Virtuoso Faceted Search is significantly more powerful than just a
search and navigation tool- it facilitates the iterative construction of an increasingly
sophisticated query. For example, to determine the most popular target in DrugBank,
first select the“attributes” link, which provides a list of predicates, including the drug-
bank_vocabulary:target, which points to DrugBank Targets. Selecting this predicate
displays a list which can then be aggregated using “Distinct values (Aggregated)” to
rank the targets by the number of entities that link to it using the „drug-
bank_vocabulary:target‟ predicate. Figure 5 shows that cell division protein kinase 2
is the highest referenced target (270 times) in DrugBank. Selecting “Entity1” in the
top part of the query builder then shows the 270 drugs that target this enzyme, as well
as the option to view the SPARQL query behind the faceted search and get a perma-
link to the facet.


Figure 5 A count-ranked list of the attributes for all drug-target interactions.


3.6       Sig.ma powered mashups
          17
Sig.ma is an online browser that enables the mashup of data from one or more on-
line resources (REST APIs, SPARQL endpoints, etc) using a keyword based search.
We set up an instance of sig.ma to point to three endpoints (PharmGKB, DrugBank,
NDC) and searched for „aspirin‟. What is returned (Figure 6) is a mash-up of all
resources that have “aspirin” in the rdfs:label, which is evident from the set of 23
labels and 8 types from the 3 endpoints (DrugBank: drug-drug interactions, pharma-
ceutical, side-effect; NDC: ingredient, substance, product and human OTC;
PharmGKB: chemical).While having all the labels listed together is an unusual UI
design, each attribute is linked to its source data item. By “approving” a source item,
and hiding all the others, it becomes possible to see a single entry (Figure 7).


17
     http://sig.ma/
    Figure 6 Sig.ma search with "aspirin" over PharmGKB, DrugBank and NDC


    Figure 7 View of a single entry from the sig.ma mashup


4       Discussion and Conclusions

Bio2RDF Release 2 features updates to data conversion scripts, datasets and functio-
nality. The use of GitHub as an open software development environment makes it
possible for enthusiasts to contribute new code and make improvements and sugges-
tions to existing code. We welcome those that think Bio2RDF could be useful to their
projects to contact us on the mailing list and participate in the development team.
          The use of a Bio2RDF resource registry in each script will ensure that all
Bio2RDF IRIs are in fact using validated namespaces (resource short names). Impor-
tantly, the addition of synonyms means that scripts can now map infrequently used or
unusual database names and IRIs to a canonical Bio2RDF IRI. Our effort to develop a
consistent registry of datasets and namespaces follows in the footsteps of our large
scale aggregated namespace directory. Importantly, we have provided this directory to
the maintainers of identifiers.org to be incorporated into the MIRIAM registry [8]
which powers it. Once we have merged our resource listings, we expect to make di-
rect use of the MIRIAM registry to list new entries, and to have identifiers.org list
Bio2RDF as a resolver for most of its entries. Moreover, since the MIRIAM registry
describes regular expressions that specify the identifier pattern, Bio2RDF scripts will
be able to check whether an identifier is valid for a given namespace, thereby improv-
ing the quality of data produced by Bio2RDF scripts.
        While we have described how dataset metrics are useful to summarize the
RDF graph, and can be used to facilitate the construction of SPARQL queries as ex-
emplified by the SPARQLed tool, we anticipate that these metrics will also be fun-
damentally useful in monitoring dataset flux. Users will no longer need to perform
expensive queries over Bio2RDF endpoints to assess changes or updates to data as the
relevant information (such as total number of triples, number of records of a given
type, type-type relations etc.) is available in the pre-computed metrics, which will be
generated with each data release and recorded as a „snapshot‟ of the dataset at crea-
tion time. This is particularly timely, as recent efforts at the 2012 BioHackathon in
Japan yielded an effort to assess the “sparkliness” of SPARQL endpoints18 and to
monitor their uptime. The dataset metrics also make it possible to assess the growth of
datasets over time, in order to make projections about the hardware and software re-
sources required to provision the data to Bio2RDF users. This will become increa-
singly important as we explore the provision of Bio2RDF data and related services in
a cloud computing environment.
        In summary, Bio2RDF Release 2 features updates to dataset conversion scripts
as well as new datasets, a framework for recording dataset provenance, and a set of
scripts to generate and publish Bio2RDF dataset metrics. We have demonstrated how
multiple open source tools can be used to visualize and explore Bio2RDF data (sec-
tions 3.4-3.6), as well as how dataset metrics may be used to inform querying. Future
work will involve the development of a „sandbox‟ for exploring and analyzing
Bio2RDF data as well as the addition of more datasets through registry-compliant
scripts.


5         References

1.         Razick S, Magklaras G, Donaldson IM: iRefIndex: a consolidated protein
           interaction database with provenance. BMC Bioinformatics 2008, 9:405.
2.         Goble C, Stevens R: State of the nation in data integration for bioinformatics. J
           Biomed Inform 2008, 41(5):687-693.
3.         Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J: Bio2RDF: towards a
           mashup to build bioinformatics knowledge systems. J Biomed Inform 2008,
           41(5):706-716.
4.         Ruttenberg A, Rees JA, Samwald M, Marshall MS: Life sciences on the Semantic
           Web: the Neurocommons and beyond. Brief Bioinform 2009, 10(2):193-204.
5.         Momtchev V., Peychev D., Primov T., Georgiev G.: Expanding the Pathway and
           Interaction Knowledge inLinked Life Data. In: Semantic Web Challenge: 2009;
           Amsterdam; 2009.


18
     https://github.com/dbcls/bh12/wiki/Yummy-data
6.   Chen B, Dong X, Jiao D, Wang H, Zhu Q, Ding Y, Wild DJ: Chem2Bio2RDF: a
     semantic framework for linking and data mining chemogenomic and systems
     chemical biology data. BMC Bioinformatics 2010, 11:255.
7.   Campinas S PT, Ceccarelli D, Delbru R, Tummarello G: Introducing RDF Graph
     Summary With Application to Assisted SPARQL Formulation. . In: 23rd
     International Workshop on Database and Expert Systems Applica-tions. Vienna
     Austria; 2012.
8.   Juty N, Le Novere N, Laibe C: Identifiers.org and MIRIAM Registry: community
     resources to provide persistent identification. Nucleic Acids Res 2012,
     40(Database issue):D580-586.

</pre>