DisGeNET: from MySQL to Nanopublication,
        Modelling Gene-Disease Associations
               for the Semantic Web

                 Núria Queralt-Rosinach and Laura I. Furlong

             Research Programme on Biomedical Informatics (GRIB),
              Hospital del Mar Medical Research Institute (IMIM),
                        Pompeu Fabra University (UPF),
                   C/ Dr. Aiguader 88, 08003 Barcelona, Spain


      Abstract. DisGeNET is a relational gene-disease database that has
      been converted to RDF in order to make gene-disease association data
      available for Semantic Web projects such as Open PHACTS. In this pa-
      per, the conversion of DisGeNET from MySQL to RDF and their mod-
      elization to the nanopublication data format is presented, and we discuss
      the challenges encountered throughout the process.

      Keywords: Semantic Web, gene-disease association, relational data-
      base, RDF, ontology, nanopublication.


1    Introduction
The ideal data infrastructure for a pharmaceutical researcher is one that makes it
easy to carefully assemble, overlay and search across heterogeneous data sources
in order to extract knowledge to solve drug discovery complex questions. The
RDF Semantic Web (SW) technology has gained significant presence in the Life
Sciences to connect the various databases in this field. The Open PHACTS (Open
Pharmacological Concept Triple Store) is a project funded by a European grant
from the Innovative Medicines Initiative (IMI; http://www.imi.europa.eu)
that aims to integrate distributed heterogenous data sources in a SW approach,
developing an open source, open standards and open access innovation platform,
the Open Pharmacological Space (OPS). The project intends to reach this goal
by using the Linked Data approach (http://linkeddata.org) and managing
the data in an RDF triple store. This semantically enriched and fully interop-
erable platform currently contains the relationships between compound-target-
pathway concepts and, consequently, it delivers information on small molecules
and their pharmacological profiles as well as on biological targets and pathways.
But, it is necessary the addition of known gene-disease associations to answer
important research questions that cannot be addressed with the existing OPS,
such as which compounds could effectively inhibit targets involved in a key path-
way for the development of a disease, to explore potential toxic interactions, or
drug repositioning opportunities in new therapeutic areas.
2       DisGeNET: from MySQL to Nanopublication

    In our lab a relational database called DisGeNET [1, 2] was created in order
to contain the current knowledge of human genetic diseases including mendelian,
complex and environmental diseases. DisGeNET is a comprehensive gene-disease
                                                                                TM
database that integrates gene-disease associations stored in UniProt, CTD ,
GAD, MGD databases and text-mining derived associations from the literature-
derived human gene-disease network (LHGDN) database. In addition, gene-
pathway information from Reactome and SNPs associated to gene-disease re-
lationships is provided in order to have a more complete picture of the bio-
logical processes underlying a disorder and the correlation of specific genomic
variants with disease predisposition. The integration is performed by means of
gene-disease vocabulary mapping and by using a new gene-disease association
ontology. Since source databases use two different disease vocabularies (MIM
and MeSH terms), a vocabulary mapping is done by means of the UMLS R
Metathesaurus R concept structure. Therefore, DisGeNET in RDF could be im-
plemented in OPS enabling the inclusion of disease-gene-pathway concepts in
the platform and to integrate its data with OPS compound/drug data. For this
reason, the DisGeNET MySQL database has been converted into the RDF data
model. Moreover, as the Open PHACTS project is co-developing and exploiting
the nanopublication format, which allows individual data to be publishable, cited
and attributed in a RDF-based approach, we are adapting our RDF DisGeNET
data to the OPS nanopublication model according to the latest Open PHACTS
guidelines, since our data can benefit from its citability and publishable features.

    In this paper, we present DisGeNET as a new RDF gene-disease associa-
tion database, the methodology used for the MySQL-RDF conversion, the new
ontology developed to model the gene-disease association concept, the nanopub-
lication data model, and, finally we discuss some of the challenges encountered
throughout the process.


2   Results

To convert a relational database to a RDF we first identified the lists of con-
cepts and relations from our data. Once this was done, a RDF data model
schema that represents the knowledge stored in our database was created. Our
RDF data model captures the central role that gene-disease associations play in
our database to comprise the whole spectrum of human diseases with a genetic
origin. In an RDF data representation model, the information from different
data sources is semantically connected to each other using existing commonly
shared ontologies. Then, we explored the existing ontologies via services such as
BioPortal in order to find matching entries in those existing ontologies for each
of our concepts and relations. RDF properties were mapped onto a limited set of
external ontologies and vocabularies that include the SemanticScience Integrated
Ontology (SIO) for general science, NCI Thesaurus for biomedical terms, and
Dublin Core to encode license information. We also used common vocabularies
such as rdf:, rdfs:, and owl:. Resources, i.e. objects and subjects in RDF triples,
                              DisGeNET: from MySQL to Nanopublication            3

were identified by dereferenceable Internationalized Resource Identifiers (IRIs)
built upon DisGeNET IDs, which are IDs of other data collections. The providers
of these IRIs are the new Identifiers.org service (http://identifiers.org)
+ the MIRIAM Registry [4], and the Bio2RDF project [3]. Nevertheless, as
the disease concept in our database is identified by the Unified Medical Lan-
guage System R Concept Unique Identifier (UMLS R CUI), we decided to use
the Linked Life Data (http://linkedlifedata.com) provider instead of the
Human Disease Ontology (the later also integrated in identifiers.org) because is
directly based on the UMLS R CUI.

    Common ontologies have been used whenever possible, but in the case of de-
scribing DisGeNET gene-disease association resources, new semantic terms had
to be created because no similar viable terms exist. Therefore, the RDF conver-
sion of DisGeNET is accompanied by a new gene-disease association ontology
developed in our lab for a correct semantic integration of gene-disease associ-
ation data from diverse data sources. For generating RDF triples we used the
D2RQ platform (http://d2rq.org) and the RDF/Turtle language. Validation
of data was done with Protegé platform
(http://protege.standford.edu).


3   Discussion

It is well known that the namespace of biomedicine is messy and ambiguous
and lacks universal standards unlike other disciplines. But, this problem is not
exclusive to the identification of resources; many synonyms exist on the Web for
key concept classes such as genes, proteins, genetic variations and diseases. For
this reason, the most difficult part in the RDF conversion of DisGeNET was to
find proper IRIs for properties and resources but, also, adequate namespaces for
semantic types of concept classes. An exhaustive search for ontologies was made
since we tried to choose those ontologies that fit best with the meaning of our
concepts/properties and that are commonly used by the scientific community
but, also, in the Open PHACTS project. An important problem not yet solved
is the use of valid IRIs to describe the RDF nodes for gene-disease association
concept. This is a major task as it is required that IRIs are dereferenceable, i.e.
identifiers for which is possible to get information about the referenced resource
on the Web. There are some possible solutions such as registering each instance
of the concept to the MIRIAM Registry. Another issue still not adressed is to
use a valid IRI pattern to identify disease MeSH hierarchy classes, as MeSH does
not have an IRI pattern available. The nanopublication format raises another
example of this IRI problem as each named graph of a nanopublication and the
entire nanopublication unit itself needs an IRI pattern schema. Another problem
is the proper tracking of the several modifications that a nanopublication may
have due to updating/curation processes over time. We are currently tackling
all these issues.
4      DisGeNET: from MySQL to Nanopublication

   Regarding licensing, DisGeNET is distributed under the GNU GPL 3.0 li-
cense. This means that we have made data available as open data. License incom-
patibilities are omnipresent in open source data. This opens the question about
using IRIs from databases with no open data licenses in our RDF database. Is
the use of IRIs subject to licensing?

4   Conclusions
To sum up, we have carried out the RDF conversion of DisGeNET, a derivative
relational database that integrates data connected with the gene-disease associa-
tion concept from several public data sources (curated databases and literature),
in order to integrate it in the OPS platform of the Open PHACTS project. Im-
portantly, DisGeNET could provide significant data to answer relevant scientific
pharmacological complex questions thanks to the introduction of the disease con-
cept into OPS and its relationship with genotype. Specifically, the RDF version
of DisGeNET is a set of triples that include information around gene-disease
associations, such as SNPs related in the bibliography to predispositione to the
disease, and the pathways where genes are known to be involved. In the main,
this conversion has been done according to the Linked Data principles, open
access and interoperability of the data. Currently, we are tackling the modeliza-
tion of the RDF triples to the nanopublication format because this model could
allow both to better adapt to the features of OPS platform and to benefit from
their own advantages as the citability. In the future, the implementation of a
SPARQL endpoint to provide open access to the information and to query RDF
DisGeNET data will be evaluated.

Acknowledgments. The research leading to these results has received sup-
port from the IMI Joint Undertaking under grant agreement n◦ 115191, Open
PHACTS, resources of which are composed of financial contribution from the
EU FP7 (FP7/2007-2013) and EFPIA companies’ in kind contribution; and the
Instituto de Salud Carlos III FEDER (CP10/005249). The Research Programme
on Biomedical Informatics (GRIB) is a node of the Spanish National Institute
of Bioinformatics (INB).

References
1. Bauer-Mehren, A., Rautschka, M., Sanz, F., Furlong, L.I.: DisGeNET: a Cytoscape
   Plugin to Visualize, Integrate, Search and Analyze Gene-Disease Networks. BMC
   Bioinformatics. 26, 2924–2926 (2010)
2. Bauer-Mehren, A., Bundschus, M., Rautschka, M., Mayer, M.A., Sanz, F., Furlong,
   L.I.: Gene-Disease Network Analysis Reveals Functional Modules in Mendelian,
   Complex and Environmental Diseases. PLOS One. 6, e20284 (2011)
3. Belleau, F.: Bio2RDF: towards a Mashup to Build Bioinformatics Knowledge Sys-
   tems. J. Biomed. Inform. 41, 706–716 (2008)
4. Juty, N., Le Nov, N.: Identifiers.org and MIRIAM Registry: Community Resources
   to Provide Persistent Identification. Nucleic Acids Res. 40, D580–D586 (2012)