<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LinkedPPI: Enabling Intuitive, Integrative Protein-Protein Interaction Discovery</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laleh Kazemzadeh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maulik R. Kamdar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oya D. Beyan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Decker</string-name>
          <email>stefan.deckerg@deri.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Barry</string-name>
          <email>frank.barry@nuigalway.ie</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Center for Data Analytics, National University of Ireland</institution>
          ,
          <addr-line>Galway</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Regenerative Medicine Institute, National University of Ireland</institution>
          ,
          <addr-line>Galway</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Understanding the dynamics of protein-protein interactions (PPIs) is a cardinal step for studying human diseases at the molecular level. Advances in \sequencing" technologies have resulted in a deluge of biological data related to gene and protein expression, yet our knowledge of PPI networks is far from complete. The lack of an integrated vocabulary makes querying this data di cult for domain users, whereas the large volume makes it di cult for intuitive exploration. In this paper we employ Linked Data technologies to develop a framework `LinkedPPI' to facilitate domain researchers in integrative PPI discovery. We demonstrate the semantic integration of various data sources pertaining to biological interactions, expression and functions using a domain-speci c model. We deploy a platform which enables search and aggregative visualization in real-time. We nally showcase three user scenarios to depict how our approach can help identify potential interactions between proteins, domains and genomic segments.</p>
      </abstract>
      <kwd-group>
        <kwd>Protein-Protein Interaction Network</kwd>
        <kwd>Linked Data</kwd>
        <kwd>Domainspeci c Model</kwd>
        <kwd>Visualisation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <sec id="sec-2-1">
        <title>Background</title>
        <p>The study of biological networks forms the integral core of biomedical research
related to human diseases and drug development. The ultimate goal of such
studies is to understand the connections between di erent genes and proteins, how
the cell signals propagate across these networks and regulate their functionality.
Hence understanding the Protein-Protein Interaction (PPI) networks underlying
each such cellular mechanism is important to speci cally target the dysfunctional
proteins, leading towards the discovery of potential drugs and treatments for
diseases. Studying PPI networks helps understand the interconnectedness between
di erent cellular mechanisms and pathways. Biological pathways are not
independent of each other, but their interactions are harmonious, which makes them
part of a bigger network. Thus it is important to investigate the dynamics of the
cell system as a whole.</p>
        <p>Human genome contains more than 20000 protein-coding genes which
interact tightly in order to regulate various cellular pathways and mechanisms.
The major challenge in developing a thorough understanding of these cellular
mechanisms and pathways is to complete the PPI network for each mechanism.
However, experimental validation of the binary interactions between total
number of proteins is an inconceivable task thus computational models can be used
to aid the researchers. These models help identify the sequential, structural and
physicochemical properties of known interacting protein pairs and highlight the
underlying patterns. Researchers then apply these patterns to narrow down the
potential interacting partners for any protein(s) under investigation. Therefore
wet-lab validation of the hypothesis formed around the predicted links and
protein partners is realistic and achievable.</p>
        <p>
          In computational models, experimentally validated PPIs form the backbone
of PPI networks, however data pertaining to gene-expression, domain-domain
interactions and genomic locations have proved their valuable contribution in
inference and prediction of new links between protein pairs [
          <xref ref-type="bibr" rid="ref19 ref6">6,19</xref>
          ]. Each of these
data sources has been published to address speci c, albeit very di erent research
problems. Therefore the data representation, data model and formats may vary
from one data source to the other. Challenges stemming from the heterogeneity of
the data emphasis the need for a framework which can bridge these biologically
di erent concepts in order to highlight and extract the ubiquitous patterns,
inconspicuous in the bigger picture.
1.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Motivation</title>
        <p>
          Due to advances in sequencing technologies, enormous amount of
experimental data has been generated and stored as independent databases. Databases
such as BioGRID [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], HPRD [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], MINT [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] contain the experimentally
validated binary interactions, while UniProt [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ], Ensembl [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], Entrez-Gene [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and
Gene Ontology [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] o er sequence information, genome localisation and cellular
functionality of individual genes and proteins. On the other hand, knowledge
bases like Pfam [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] contain information regarding the functional and structural
protein subunits (domains).
        </p>
        <p>
          The main motivation of this work is to provide researchers with a framework
which enables them to retrieve the answers to their research questions from these
disparate data sources. A researcher interested in the list of protein domains in
a speci c protein can look up the UniProt website3 which is extensively rich
in protein information. Genomic locations of protein-coding genes are publicly
available from several websites such as CellBase [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. However questions like, `List
of all the proteins which contain the exact or partial set of protein domains?' or
`What is the relation of a set of interacting proteins and the genomic location of
their underlying genes?' cannot be answered through these websites.
        </p>
        <p>The challenges in the aggregation and exploration of the aforementioned
massive biological data sources have sparked the interests of several domain
researchers and led them towards the adoption of a new generation of integrative</p>
        <sec id="sec-2-2-1">
          <title>3 http://www.uniprot.org</title>
          <p>
            technologies, based on Semantic Web Technologies and Linked Data concepts,
thus giving birth to Integrative Bioinformatics [
            <xref ref-type="bibr" rid="ref25 ref7">25,7</xref>
            ].
2
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>
        The nal goal of this research is the identi cation and extraction of potential PPI
networks from various publicly available data sources. The core structure of the
PPI network consists of proteins and their experimentally proven interactions.
Fig. 1 depicts an overview of the LinkedPPI architecture. Following subsections
will describe data selection, RDFization and integration methodologies used.
Validated Interactions: Experimentally validated interactions were retrieved
from BioGRID (Biological General Repository for Interaction Datasets), one
of the most comprehensive PPI databases [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. Only physical interactions were
included in our work, regardless of their classi cations as raw or non-redundant.
Protein Complexes: In most cellular processes proteins act as a complex,
instead of binary interactions between a pair of single proteins [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], during the
same time and within the same cellular compartment. Such proteins are tightly
interacting and play key roles in PPI networks. Elucidation of the dynamics of
      </p>
      <p>
        PPI networks and functionality of individual proteins can bene t from identi
cation of essential protein complexes, since di erent subunits contribute to drive
a cellular function. In this work, we have used the latest release of CORUM
(Comprehensive Resource of Mammalian protein complexes) [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
Gene Expression: Understanding the correlation between gene-expression
networks and protein interaction networks is an ongoing challenge in PPI studies.
Proteins coded from co-expressed genes are more likely to interact with each
other [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and there is higher probability that an interacting pair of proteins
share cellular functions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We have used the COXPRESdb database [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] which
publishes recent gene expression microarray datasets for Human.
Genomic Locations: Neighboring genes show similar expression pattern and
are often involved in similar biological functions [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] which suggest that they
might share same activation and translation mechanisms. These interactions
may not be limited to the adjacent genes but can be long-range interactions to
ful l the cellular functionality [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Such evidences encouraged us to introduce
a layer for the genomic locations of the protein-coding genes in our framework.
We do not de ne `genomic location' as the exact start/stop position of genes on
a chromosome, but as the Ideogram band in which the genes reside. Ideograms
are schematic representations which depict xed staining patterns on a tightly
coiled chromosome in Karyotype experiments. Karyotype describes number of
chromosomes, their shape and length and banding patterns of chromosomes in
the nucleus. Ideogram data was downloaded from the Mapping and
Sequencing Tracks in the Human Genome Assembly (GRCh37/hg19, Feb 2009) at the
UCSC Genome Browser4. The start/stop coordinates of the genes were retrieved
from CellBase [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and used to determine the genes within each ideogram. HGNC
(HUGO Gene Nomenclature Committee) was used to map common genes
referenced by di erent identi ers (Entrez-Gene, Ensembl and UniProt) [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ].
Protein Domains: Proteins functionality and their structures are de ned by
their domain speci cation. Each protein consists of single or multiple domains,
mutual sharing of which may lead to interaction with other proteins. However,
identi cation of domain interactions through experimental validation for all
possible protein pairs is an insurmountable task. Therefore domain knowledge bases
can shed light on PPIs as well as help identify novel domain-domain interactions.
We used 3did (Database of three-dimensional interacting domains) [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] which
contains high resolution three-dimensional structurally interacting domains.
Gene Co-occurrence: Studying co-occurrence networks of genes can lead to
the prediction of novel PPIs and discovery of hidden biological relations [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
Previously Kamdar et al. generated co-occurrence scores as a weighted
combination of the total number of diseases, pathways or publications in which any
two genes occur simultaneously [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>2.2 LinkedPPI Data Integration</title>
        <p>One of the crucial challenges in integrative bioinformatics is the heterogeneous
nature of biological data sources. Even though several attempts have been made</p>
        <sec id="sec-3-1-1">
          <title>4 http://genome.ucsc.edu/</title>
          <p>
            in the standardization of the data through controlled vocabularies and guidelines,
various hurdles still need to be surpassed. The proteomic standards
initiativemolecular interaction (PSI-MI) is widely accepted by the community for the
modelling of biological networks [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. Even though some of the data sources are
represe12 nted using the PSI-MI format there is no decipherable interconnectedness
between them. To introduce the desired interconnectedness or `bridges' between
these data sources, we decided to use Linked Data technologies.
!
          </p>
          <p>detects!</p>
          <p>Experiment!
! "System!
"Type!
"Scale!
!
!</p>
          <p>Protein!
"UniprotID!
!</p>
          <p>Gene!
"EntrezID!
!</p>
          <p>hasDomain!
Protein!Domain!
"PFamID!
!</p>
          <p>left"right!
DD!Interaction!
"PFamID!
"!Score!
"Inferred!Score!
partOf!
hasGene!</p>
          <p>Interaction!
"ID!
"Source!Database!
"Score!
"Modification!
"Qualification!
"Tags!
!
!</p>
          <p>left"right!</p>
          <p>Component!
"Official!Symbols!
"Synonyms!
"Description!
!!
!
!</p>
          <p>hasComponent!</p>
          <p>Complex!
"ID!
"Name!
"Synonyms!
"Component!Synonyms!
"Function!Comment!
"Disease!Comment!
"SubUnit!Comment!
!
! Ideogram!
""ISDta!(rcth!rm"Ideogram)! left"right!
"Stop!
"Gene!Count!</p>
          <p>Publication!
publishedIn! "Title!
"Author!
"Year!
"Abstracts!
!
detected!In!</p>
          <p>Organism!
"TaxonomyID!
"Name!
!
detectedIn!
publishedIn!</p>
          <p>II!Interaction!
"Inferred!Score!
!</p>
          <p>
            We proposed a simple concise domain model, for the modelling of the PPIs
retrieved from BioGRID, complexes from CORUM, protein domains and
genomic location. A domain-speci c model is bene cial over an extensive
wellconstrued ontology due to the absence of non-domain-speci c concepts (Thing,
Continuant, etc.) and is much smaller and self-contained to address a speci c
problem. Being native to a particular domain (e.g. Protein-protein interactions ),
it serves as an intermediate layer between the user and the underlying data, and
enables intuitive knowledge exploration and discovery [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. Our model comprises
12 concepts, which are termed relevant in this domain, and is shown in Fig. 2.
The core concept in this model is a Component. A Component can either be a
Gene or a Protein. A Component can be part of a Complex or can interact with
another Component through an Interaction. The Organism, in which both the
Interaction and the Complex are detected, is also available as a distinct concept.
The Experiment concept embodies the attributes related to the experimental
system which was used to detect the interaction (e.g. Y2H, AP-MS), the scale of
the experiment (high or low-throughput), and whether it is a physical
interaction or genetic. Publication documents the experiments, and links to resources
described in the PubMed repository. A Gene is contained within an Ideogram,
and IIinteraction represents inferred interactions between two ideograms from
experimentally validated PPIs. Domains associated with protein-coding genes
are represented using Pfam IDs and DDinteraction models interaction scores
between two domains retrieved from 3did and inferred from BioGRID.
          </p>
          <p>
            Open Re ne provides a workbench to clean and transform data and
eventually export it in required format. We used its RDF Extension [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ] to model
and convert the tab-delimited les downloaded from CORUM, BioGRID, 3did
and CellBase to RDF graphs and stored them in a local Virtuoso Triple Store5.
Data from COXPRESdb was already published on the web as RDF, and we
re-used their data model and URIs. Similarly, for other data sources we
reused the URIs for the genes, proteins and publications, from those provided by
Entrez-gene, UniProt and PubMed. To determine which gene is responsible for
the encoding of which protein (mapping between Entrez-Gene ID and UniProt
ID), we used the ID mapping table6 provided by UniProt. One of the major
advantages of using this approach was that the mapping also linked the relevant
Gene Ontology (GO) [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] terms to the Entrez-Gene ID, thus providing additional
information regarding the localisation and function of the speci c genes.
3
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>After RDFization, the BioGRID data source contains around 11 million triples
(11357231), which establish 634996 number of distinct interactions between
14135 Human proteins. The data source also links out to 38952 unique PubMed
publications documenting these PPIs. The CORUM data source consists of
156364 triples, with 2867 distinct complexes. The 3did data source consists of
320690 triples with 6818 distinct protein domains and 61582 validated and
inferred domain-domain interactions. We inferred a total of 13493 interactions
between 405 ideograms, referenced through 80092 triples. 60676 mappings were
instantiated between the genes and 7750 extracted GO child leaves.</p>
      <sec id="sec-4-1">
        <title>3.1 Search and Visualization</title>
        <p>
          As such, relevant information could be retrieved from the SPARQL Endpoint
through the formulation of appropriate queries. However as SPARQL requires
a steep learning curve, the non-technical domain user needs intuitive,
interactive visualization tools, which aggregate this information from the multiple data
sources and summarize it. We devised a PPI Visualization Dashboard7 based
on ReVeaLD (Real-time Visual Explorer and Aggregator of Linked Data) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]
to accommodate our requirements for the search and visual exploration of the
LinkedPPI networks. As the user starts typing the o cial symbol of the
desired protein, a list of possible alternatives will be retrieved from the indexed
entities. On selection, the entity URI is passed as a parameter through a set of
        </p>
        <sec id="sec-4-1-1">
          <title>5 http://srvgal78.deri.ie/sparql</title>
          <p>6 ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/
knowledgebase/idmapping/README
7 http://srvgal78.deri.ie/linkedppi
pre-formulated SPARQL SELECT queries8 targeting the various data sources.
As shown in Fig. 3, the PPI network associated with the searched protein (e.g.,
HES1 entrezgene:3280 ) is rendered in a force-directed layout. The list of entities
retrieved from the data sources are represented as circular nodes, with the size
of each node directly proportional to the number of associated nodes. The nodes
are rendered using di erent colors for the sake of visual di erentiation - Red for
Components (Proteins of BioGRID or Genes of COXPRESdb), Blue for
CORUM Complexes, Light Brown for 3did Protein Domains. The three categories
of GO Child Nodes - Biological Processes, Molecular Functions and Cellular
Components are displayed using Green, Yellow and Purple colors.</p>
          <p>The interactions between di erent proteins are represented as edges - the
color of the edges is directly dependent on whether the associations have been
retrieved from BioGRID, COXPRESdb or Co-occurrence Data (Black, Red and
Purple respectively). The thickness of Black edges depends on the total number
of publications which have experimentally validated the underlying interactions.
The thickness of the Red and Purple edges depends on the PCC (Pearson
Correlation Coe cient for Gene Expression) and Co-occurrence scores respectively.
The Protein nodes, which are present in the same complex, possess interacting
domains or have underlying coding genes associated to the same GO terms, are
not connected directly to each other by edges. They are all connected using
similar colored edges to the respective node (complex, domain or GO term), however
there may be instances with experimental interactions or co-expression between
the connected entities. The resulting network is hence densely clustered, rather
8 https://gist.github.com/maulikkamdar/a47fbecddecc6ba4b373
than a simplistic radial layout of nodes. Hovering over any node highlights
subgraph of the network which only displays the rst-level connected nodes and
their relations (Fig. 4), hence allowing any domain user to intuitively deduce
answers to simple questions like, `Which protein-encoding genes in the network
share the same molecular function and have experimental co-expression?' An
information box is also displayed beside the hovered node to show additional
information like GO term descriptions, Pfam or PubMed IDs, and PCC scores.
Zooming and panning across the visualization is possible using a mouse.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>3.2 Use Cases</title>
        <p>The following subsections describe three di erent scenarios depicted in Fig. 5
that our framework could be employed to facilitate extraction of implicit
information which can be used as predictors of novel protein-protein interactions.
The relevant SPARQL Queries are documented at http://goo.gl/xesMjR.
Use Case 1: Extraction of Potential Protein-Protein Interactions Based
on the Domain-Domain Interactions. Proteins carry on their functions
through their protein domain(s), is a well-known fact. In this scenario we aim
to extract possible PPIs based on the known domain-domain interactions. For
the sake of simplicity in this use case we assume we are interested in proteins
which contain single domains. A researcher has a protein in mind for which
the sequence speci cation and domain composition are known. An interesting
question might be, list of potential protein partners for this protein. Using our
framework researcher can retrieve the list of protein pairs in which at least one
of the proteins contains the same protein domain as the protein under
question. Possible outcomes are: a) the protein under investigation itself shows up
in the result set which forms the list of its experimentally validated protein
partners. However this could be queried from the BioGRID web site directly. b)
List of proteins with one single domain. In this case with a naive and
straightforward conclusion, the researcher may accept the list. However in most cases
further application of GO enrichment or advanced statistical analysis o er a
more concise list but these analyses are beyond the scope of this work. c) The
query results to a set of proteins consisting of several domains which requires
further statistical or domain expert knowledge re nement. Despite the need of
further investigation in such cases, the shortlisted hypothetical interaction
partners are expected to be brief to save a tremendous amount of time and e ort.
The SPARQL Query uses the example of the HES1 (entrezgene: 3280 ) protein.
We obtain the list of domains present in HES1 - Hairy Orange (pfam:PF07527 )
and HLH(pfam:PF00010 ), and the list of proteins (e.g. HEY2) which share these
domains, or have domain-domain interactions. We then retrieve validated PPIs
in which the protein participates (e.g. HEY2-SIRT1). We can also obtain the
PubMed publication documenting each PPI.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Use Case 2: Identi cation of Potential Domain-Domain Interactions.</title>
        <p>
          Protein-protein interactions can be identi ed experimentally through various
types of experiments (e.g: Yeast Two-Hybrid). However it is not possible to
identify the interacting domains between two proteins from same experiments
and it requires a set of di erent experiments and protocols. Often protein
domains act as signature elements and repeatedly interact with each other within
the same organism. Therefore these frequent observations assist in identi cation
of novel domain-domain interactions which is enlightening in identi cation of
latent PPIs. Nevertheless in this work such observations are inferred implicitly
from the validated PPI dataset (BioGRID) and require further statistical signi
cance analysis. In our SPARQL Query example, we retrieve the validated and the
inferred scores for domain interactions with the HLH domain (pfam:PF00010 ).
Use Case 3: Identi cation of Selective Interactions between Segments
of Human Genome. Human chromosomes are compact in 3D space with each
chromosome folding into its own territory [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Even though the exact relation of
spatial conformation of genes and their functionality is not fully understood yet,
studies have shown that the structure of the human genome follows its
functionality [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. It is widely believed that chromosomal folding bring functional
elements in close proximity regardless of their inter- or intra-chromosomal
distance in base pair unit. In other words the concept of close and far in relation
to the spatial map of genome is represented di erently. Also, it has been shown
that the contacts between small and gene-rich chromosomes are more frequent
[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. These evidences suggest linkages between chromosomal conformation, gene
activity and their products (proteins) functionality. Identifying the signi cance
of association of genomic location of genes and their products partner selection
will aid in completion of the proximity pattern followed by genes and lead by
their arrangement. The prospective pattern can be employed to the prediction
models in order to infer potential protein interactions.
        </p>
        <p>In this work we have selected the boundary of ideogram bands on each
chromosome as genomic location of each gene. Several genes may reside on one
ideogram as well as genes that may fall between two consecutive ideograms. The
reason for this selection of distance unit is to take into account the e ect of
the co-expression of neighboring genes and the possible shared mechanism.
Protein pairs from PPI dataset are mapped to their genomic location and simple
frequency calculation from retrieved data can identify the signi cance of
interactions between two genomic segments. Based on these ndings researchers can
propose the genomic location pattern in which proteins preferentially select their
interacting partners. These regions may contain genes involved in the same
pathways or share the same functionality which are yet to be identi ed by further
gene enrichment analysis. As shown in the provided SPARQL Query, we could
retrieve the pre-determined inferred score between two ideograms, e.g.
Chromosome 3 - q29 and Chromosome 10 - q24.32 containing the protein-coding genes
for HES1 and SIRT1 (entrezgene:23411 ) proteins respectively.</p>
        <p>Query Protein - &gt; PX
PX - PY
PD - PC
PD - PF
PA - PF
(a)
x
x
x
d
d
x
x
y</p>
        <p>Query Domain - &gt; Dx</p>
        <p>x
Dx-Dy inferred
Dx-Dc validated
Dx-Dw validated
Dx-Df inferred
Dx-Dy inferred
(b)
x
x
x
x
x
y
y</p>
        <p>
          PA
(c)
ga
gb
PB
Jiang et al. developed a semantic web base framework which predicts targets of
drug adverse e ect based on the PPIs and gene functional classi cation algorithm
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Chem2Bio2RDF [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] integrates data sources from Bio2RDF9 in order to
study polypharmocology and multiple pathway inhibitors which also requires
thorough understanding of underlying PPI network.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>The incorporation of complementary datasets for the expansion of PPI networks
is a useful approach to gain insight into biological processes and to discover novel
PPIs which have not been documented in the current PPI databases. However,
there is an inherent high level of heterogeneity at the schema and instance level of
these data sources, due to lack of a common representation schema and format.
Hence, we decided to apply Linked Data concepts in the integration, retrieval
and visualisation of concealed information. The enormous amount of publicly
available data and its dynamicity, in terms of regular updates, is currently a</p>
      <sec id="sec-5-1">
        <title>9 http://bio2rdf.org</title>
        <p>rate-limiting step to our data-warehousing approach for centralised analysis. We
have proposed a domain-speci c model which can accommodate the needs in the
eld of PPI modelling. The use of a domain-speci c model and an interactive
graph-based exploration platform for search and aggregative visualisation makes
our integration approach more intuitive for the actual users who deal with PPI
predictions. We have also proposed a set of three user scenarios depicting how
LinkedPPI framework could be used for the prediction of potential interactions
between proteins, domains and genomic regions.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Future Work</title>
      <p>The approach which has been presented in this work is used in extraction of
valuable information with regard to PPI network, domain-domain interactions
and selective genomic interactions. However the observations reported in the
outcome of such data retrieval is raw and could be a valuable asset for simulations
and prediction methods if further analysis is done. As part of the future work we
intend to apply statistical analysis on signi cance of such observations in order
to be able to develop a classi er algorithm which is able to predict interacting
and non-interacting protein pairs.</p>
      <p>Acknowledgements This work has been done under the Simulation Science
program at the National University of Ireland, Galway. SimSci is funded by
the Higher Education Authority under the program for Research in Third-level
Institutions and co-funded under the European Regional Development fund.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alberts</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The Cell as a Collection of Protein Machines: Preparing the Next Generation of Molecular Biologists</article-title>
          .
          <source>Cell</source>
          <volume>92</volume>
          (
          <issue>3</issue>
          ),
          <volume>291</volume>
          {294 (Feb
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ashburner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ball</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          , et al.:
          <article-title>Gene ontology: tool for the uni cation of biology. The Gene Ontology Consortium</article-title>
          .
          <source>Nature genetics 25(1)</source>
          ,
          <volume>25</volume>
          {29 (May
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bhardwaj</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
          </string-name>
          , H.:
          <article-title>Correlation between gene expression pro les and proteinprotein interactions within and across genomes</article-title>
          .
          <source>Bioinformatics</source>
          <volume>21</volume>
          (
          <issue>11</issue>
          ),
          <volume>2730</volume>
          {
          <fpage>2738</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Bin</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Xiao</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.J.H.W.Q.Z.Y.D.</given-names>
            ,
            <surname>Wild</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.J.:</surname>
          </string-name>
          <article-title>Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data</article-title>
          .
          <source>BMC Bioinformatics</source>
          <volume>11</volume>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bleda</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarraga</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>de Maria</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salavert</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , et al.:
          <article-title>CellBase, a comprehensive collection of RESTful web services for retrieving relevant biological information from heterogeneous sources</article-title>
          .
          <source>Nucleic acids research</source>
          <volume>40</volume>
          (
          <issue>W1</issue>
          ),
          <source>W609{W614</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chatterjee</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basu</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>K.M.N.M.P.D.</surname>
          </string-name>
          :
          <article-title>PPISVM: prediction of protein-protein interactions using machine learning, domain-domain a nities and frequency tables</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , J.Y.:
          <article-title>Semantic Web meets Integrative Biology: a survey</article-title>
          .
          <source>Brie ngs in Bioinformatics</source>
          <volume>14</volume>
          (
          <issue>1</issue>
          ),
          <volume>109</volume>
          {125 (Jan
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Flicek</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Ensembl 2012</article-title>
          . Nucleic acids research p.
          <year>gkr991</year>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Goel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harsha</surname>
            ,
            <given-names>H.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pandey</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prasad</surname>
            ,
            <given-names>K.S.</given-names>
          </string-name>
          :
          <article-title>Human Protein Reference Database and Human Proteinpedia as resources for phosphoproteome analysis</article-title>
          .
          <source>Molecular bioSystems</source>
          <volume>8</volume>
          (
          <issue>2</issue>
          ),
          <volume>453</volume>
          {463 (Feb
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Grigoriev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>On the number of protein-protein interactions in the yeast proteome</article-title>
          .
          <source>Nucleic Acids Res</source>
          <volume>31</volume>
          ,
          <volume>4157</volume>
          {
          <fpage>4161</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Guoqian</surname>
            <given-names>Jiang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Chen</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.Z.</given-names>
            ,
            <surname>Chute</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.G.</surname>
          </string-name>
          :
          <article-title>A Framework of Knowledge Integration and Discovery for Supporting Pharmacogenomics Target Predication of Adverse Drug Events: A Case Study of Drug-Induced Long QT Syndrome</article-title>
          .
          <source>AMIA Summits Transl Sci Proc p</source>
          .
          <volume>8892</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hermjakob</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montecchi-Palazzi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bader</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wojcik</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salwinski</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , et al.:
          <article-title>The HUPO PSI's molecular interaction formata community standard for the representation of protein interaction data</article-title>
          .
          <source>Nature biotechnology 22(2)</source>
          ,
          <volume>177</volume>
          {
          <fpage>183</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Jelier</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al.:
          <article-title>Co-occurrence based meta-analysis of scienti c texts: retrieving biological relationships between genes</article-title>
          .
          <source>Bioinformatics</source>
          <volume>21</volume>
          (
          <issue>9</issue>
          ),
          <year>2049</year>
          {
          <year>2058</year>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kamdar</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iqbal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saleem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deus</surname>
            ,
            <given-names>H.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>GenomeSnip: Fragmenting the Genomic Wheel to augment discovery in cancer research</article-title>
          .
          <source>In: Conference on Semantics in Healthcare and Life Sciences (CSHALS)</source>
          .
          <source>ISCB</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Kamdar</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeginis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasnain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deus</surname>
            ,
            <given-names>H.F.</given-names>
          </string-name>
          :
          <article-title>ReVeaLD: A userdriven domain-speci c interactive search platform for biomedical research</article-title>
          .
          <source>Journal of biomedical informatics 47</source>
          ,
          <volume>112</volume>
          {
          <fpage>130</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Kosak</surname>
            ,
            <given-names>S.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Groudine</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Form follows function: the genomic organization of cellular di erentiation</article-title>
          .
          <source>Genes Dev</source>
          <volume>18</volume>
          , 1371{
          <fpage>1384</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Licata</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Briganti</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , et al.:
          <article-title>MINT, the molecular interaction database: 2012 update</article-title>
          .
          <source>Nucleic acids research</source>
          40(Database issue),
          <source>D857{D861 (Jan</source>
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Lieberman-Aiden</surname>
            , E., van Berkum,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Imakaev</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome</article-title>
          .
          <source>Science</source>
          <volume>326</volume>
          (
          <issue>5950</issue>
          ),
          <volume>289</volume>
          {
          <fpage>293</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.P.</given-names>
          </string-name>
          , et al.:
          <article-title>Inferring a protein interaction map of mycobacterium tuberculosis based on sequences and interologs</article-title>
          .
          <source>BMC Bioinformatics</source>
          <volume>13</volume>
          (
          <issue>Suppl 7</issue>
          ),
          <source>S6</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Maali</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peristeras</surname>
          </string-name>
          , V.:
          <article-title>Re-using cool uris: Entity reconciliation against lod hubs</article-title>
          . In: Bizer,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Heath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Berners-Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Hausenblas</surname>
          </string-name>
          , M. (eds.)
          <source>LDOW. CEUR Workshop Proceedings</source>
          , vol.
          <volume>813</volume>
          .
          <string-name>
            <surname>CEUR-WS.org</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Maglott</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ostell</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pruitt</surname>
            ,
            <given-names>K.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tatusova</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Entrez Gene: gene-centered information at NCBI</article-title>
          .
          <source>Nucleic acids research</source>
          39(Database issue),
          <source>D52{7 (Jan</source>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Michalak</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Coexpression, coregulation, and
          <article-title>cofunctionality of neighboring genes in eukaryotic genomes</article-title>
          .
          <source>Genomic</source>
          <volume>91</volume>
          ,
          <issue>243248</issue>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Obayashi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Okamura</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ito</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tadaka</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motoike</surname>
            ,
            <given-names>I.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kinoshita</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>COXPRESdb: a database of comparative gene coexpression networks of eleven species for mammals</article-title>
          .
          <source>Nucleic Acids Research</source>
          <volume>41</volume>
          (
          <issue>D1</issue>
          ),
          <source>D1014{D1020 (Jan</source>
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Ruepp</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>CORUM: the comprehensive resource of mammalian protein complexes - 2009</article-title>
          .
          <source>Nucleic Acids Research</source>
          <volume>38</volume>
          (
          <string-name>
            <surname>Database-Issue</surname>
            <given-names>)</given-names>
          </string-name>
          ,
          <volume>497</volume>
          {
          <fpage>501</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Ruttenberg</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bug</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , et al.:
          <article-title>Advancing translational research with the Semantic Web</article-title>
          .
          <source>BMC bioinformatics 8 Suppl</source>
          <volume>3</volume>
          (
          <issue>Suppl 3</issue>
          ), S2+ (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Seal</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gordon</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lush</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wright</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bruford</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          <article-title>: genenames. org: the HGNC resources in 2011</article-title>
          . Nucl.
          <source>Acids Res</source>
          .
          <volume>39</volume>
          (
          <issue>Suppl 1</issue>
          ),
          <source>D514{D519</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Sonnhammer</surname>
            ,
            <given-names>E.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eddy</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          , et al.:
          <article-title>Pfam: a comprehensive database of protein domain families based on seed alignments</article-title>
          .
          <source>Proteins</source>
          <volume>28</volume>
          (
          <issue>3</issue>
          ),
          <volume>405</volume>
          {420 (Jul
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Stark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breitkreutz</surname>
            ,
            <given-names>B.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reguly</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , et al.:
          <article-title>BioGRID: a general repository for interaction datasets</article-title>
          .
          <source>Nucleic acids research 34(suppl 1)</source>
          ,
          <source>D535{D539</source>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>R.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aloy</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : 3did:
          <article-title>interacting protein domains of known threedimensional structure</article-title>
          .
          <source>Nucleic acids research 33(suppl 1)</source>
          ,
          <source>D413{D417</source>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Uniprot-Consortium</surname>
          </string-name>
          :
          <article-title>The Universal Protein Resource (UniProt) 2009</article-title>
          .
          <article-title>Nucleic acids research 37(Database issue</article-title>
          ),
          <source>D169{174 (Jan</source>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>