=Paper=
{{Paper
|id=None
|storemode=property
|title=Network-Based Disease Candidate Gene Prioritization: Towards Global Diffusion in Heterogeneous Association Networks
|pdfUrl=https://ceur-ws.org/Vol-655/dynak2010_paper5.pdf
|volume=Vol-655
|dblpUrl=https://dblp.org/rec/conf/pkdd/GoncalvesMM10
}}
==Network-Based Disease Candidate Gene Prioritization: Towards Global Diffusion in Heterogeneous Association Networks==
Network-Based Disease Candidate Gene
Prioritization: Towards Global Diffusion in
Heterogeneous Association Networks
Joana P. Gonçalves1,2,3,∗ , Sara C. Madeira1,2 , and Yves Moreau3
1
Knowledge Discovery and Bioinformatics group (KDBIO), INESC-ID,
Rua Alves Redol 9, 1000-029 Lisboa, Portugal
2
IST, Technical University of Lisbon,
Av. Rovisco Pais 1, 1049-001 Lisboa, Portugal
3
BIOI, ESAT-SCD, Department of Electrical Engineering, KULeuven,
Kasteelpark Arenberg 10, 3001 Leuven-Heverlee, Belgium
{jpg,smadeira}@kdbio.inesc-id.pt, yves.moreau@esat.kuleuven.be,
Abstract. Disease candidate gene prioritization addresses the associa-
tion of novel genes with disease susceptibility or progression. Network-
based approaches explore the connectivity properties of biological net-
works to compute an association score between candidate and disease-
related genes. Although several methods have been proposed to date,
a number of concerns arise: (i) most networks used rely exclusively on
curated physical interactions, resulting in poor coverage of the Human
genome and leading to sparsity issues; (ii) most methods fail to incorpo-
rate interaction confidence weights; (iii) in some cases, relevance scores
are computed as local measures based on the direct interactions with the
disease-related genes, ignoring potentially relevant indirect interactions.
In this study, we seek a robust network-based strategy by evaluating the
performance of selected prioritization strategies using genes known to be
involved in 29 different diseases.
Keywords: protein-protein interaction, network, random walk, disease
candidate genes, prioritization
1 Introduction
Biomarkers play a crucial role in modern medical practice as a means of im-
proving accuracy in diagnosis, prognosis and treatment. In particular, research
has been actively devising associations of novel genes with disease susceptibility
or progression, relying on high-throughput technologies and the proliferation of
accessible resources of biological data to enable large-scale genome-wide studies.
Most computational methods proposed for disease gene prioritization aim to
identify putative candidates based on their similarity with genes known to be
involved in the occurrence of a particular phenotype, according to: intrinsic prop-
erties, functional annotations, coherent transcriptional responses via expression
data analysis, orthologous relations with genes from model organisms or even
co-occurrence in the literature [22]. Alternative strategies adopt a systemic ap-
proach and explore the topology of biological networks, including protein-protein
interactions, regulatory data or metabolic pathways. These approaches rely on
the assumption that genes co-occurring in a particular network substructure or
interacting tend to participate together in related biological processes to identify
novel genes based on their linkage with the known disease genes [22].
Integrative network-based analysis has been addressed [8,11,15,16,20,23,26],
combining knowledge from distinct resources in association networks to unravel
novel disease genes. However, most of these approaches rely solely on physical in-
teractions [8,23], potentially inferred via orthologous relations with model organ-
isms [26], often resulting in insufficient coverage of the Human genome. Others
include additional interactions predicted from coexpression, pathway, functional
or literature data, but still devise sparse networks [11, 15]. Although the risk for
false positive interactions may rise, the integration of knowledge from heteroge-
neous sources generates denser networks which tend to be less biased toward a
particular evidence, more robust to noise and thus able to perform better in the
prioritization task [16].
Network-based prioritization methods further differ in how they define the
ranking of the candidates from the known disease-related genes. Local measures
are usually computed based on the direct links or shortest paths between the
candidates and the disease-related genes [15, 16], while global strategies diffuse
or smooth a disease-related signal through the network. In this work, we evalu-
ate whether the latter should be preferred over the former, as the inclusion of
indirect associations is able to compensate for missing linkage, ultimately miti-
gating sparsity and “small world” effect issues [20], and global similarities have
recently been shown to outperform local measures [15].
Random walks or diffusion kernels arise as natural candidates for the dif-
fusion approach and their application to prioritization has been proven effec-
tive [6, 8, 15, 23]. Not only they compute fast using iterative methods, even for
large networks [6], they are also able to straightforwardly establish a ranking
of the candidates based on the global connectivity of the network. Neverthe-
less, some of the proposed methods [8, 15] ignore or fail to incorporate weights
expressing the confidence on the evidence of every particular association [16].
Furthermore, their scores are based on the steady-state probability obtained af-
ter a large number of iterations or upon convergence. In this study, we assess the
claim that limited diffusion is usually sufficient for ranking purposes [9, 10] and
on our intuition which leads us to expect the prior knowledge to be somehow
lost or of very little importance to the ranking after diffusing to a large extent.
Throughout this paper, we address the aforementioned topics by analyz-
ing the performance of different prioritization strategies in three case studies:
(i) Integrative heterogeneous protein association network vs integrative protein-
protein physical interaction network (PPPIN); (ii) Global ranking measure vs
local ranking measure; (iii) Confidence weights, degree of diffusion and parame-
ter variation.
2 Methods
A protein-protein association network can be described as a weighted undirected
graph, a special case of a weighted directed graph, defined as G = (V, E), where
V is the set of vertices and E is the set of edges. Each vertex in V and edge
in E correspond to a gene and an association between two genes, respectively.
Let A and D denote the adjacency and diagonal matrices of G, respectively.
Auv is the weight
P w(u, v) of the edge (u, v) between source u and target v.
Also, Duu = (u,v)∈E Auv , ∀u ∈ V , that is, the sum of the weights of the edges
for which u is the source. Prioritizing disease candidates thus formulates as
obtaining a ranking on V given a set S ∈ V of seed genes. For the local scoring
scheme Endeavour’s measure was used [2]. As global network-based strategies,
the PageRank with priors and Heat Diffusion random walks were applied: an
initial signal expressing the relevance of the genes in the context of the disease
in the form of a preference vector, p(0) , is diffused over the network by performing
a limited number of iterations, N .
2.1 Endeavour’s Measure: Intersection of Interactors
Endeavour computes a local network-based measure, whereby the score of each
gene is computed as the overlap between the sets of genes interacting with the
seed genes and those interacting with the candidate gene itself [2]:
X 1 , if ∃z ∈ S : (u, z) ∈ E
Sv = intSeeds(u) intSeeds(u) =
0 , otherwise
(u,v)∈E
2.2 Heat Diffusion and PageRank with Priors
Heat Diffusion is a discrete approximation of the heat kernel [28] first introduced
in [9], in which the rate of diffusion is controlled by a non-negative parameter,
the heat diffusion coefficient t. The iterative equation is given by
(i+1) t t X (i) Auv
pv = 1− · p(i)
v + pu · .
N N Duu
(u,v)∈E
PageRank with priors is an extension of the original PageRank algorithm to
consider the original probability distribution of the scores [25]. A parameter β,
called “back probability” expresses the probability of jumping to the initial node
at each iteration. The iterative equation is
X Auv
p(i+1)
v = β · p(0)
v + (1 − β) · p(i)
u .
Duu
(u,v)∈E
3 Results
Evaluation sudies were performed using Human data from the STRING database
[12] and a PPPIN from Entrez Gene [1] as representatives of protein-protein
heterogeneous association and physical interaction networks, respectively. 620
genes known to be related with 29 diseases were used as prior knowledge to
prioritize candidates in a leave-one-out cross-validation scheme.
3.1 Data and Preprocessing
Networks The STRING database [12, 18] integrates physical interactions and
predicted associations based on knowledge obtained from heterogeneous sources
of transcriptional, functional, metabolic, literature and orthology data. For a fair
comparison with Endeavour, we downloaded and parsed version 7.1 of STRING
[18], including evidences from MINT [7], HPRD [19], BIND [4], DIP [27], Bi-
oGRID [5], KEGG [13] and Reactome [24] databases. Associations from STRING
v8.2 [12] were also retrieved to assess to which extent the additional knowledge
integrated from IntAct [14], PID [21] and GO [3] protein complexes would im-
prove the prioritization performance relative to the previous release. A PPPIN
was downloaded from the NCBI Entrez Gene FTP repository [1]. 130797 Human
interactions were selected from 448534 entries, for which both interactant genes
were tagged with tax ID 9606. From these, 4611, 51275 and 74911 were originally
from BIND [4], BioGRID [5] and HPRD [19], respectively. Genes’ identifiers
followed Entrez Gene nomenclature. Preprocessing of these networks involved
filtering redundant edges and devising an explicit representation of a directed
graph. In the case of the STRING releases, original weights were used to express
the confidence of every association, while in the PPPIN all edges were attributed
weight 1. STRING v7.1 contained 16050 genes and 698534 unique associations.
STRING v8.2 covered 17448 Human genes with 1256016 non-redundant associa-
tions. Finally, the PPPIN had 47873 physical interactions between 10175 genes.
Seed sets 620 disease genes were selected from the OMIM [17] database span-
ning 29 disease-specific sets, with an average of 21 genes per set. As genes were
identified according to Ensembl nomenclature, the seeds could be directly used
with STRING. For the PPPIN, however, we performed a conversion between En-
sembl and Entrez Gene identifiers. A mapping was parsed from a file downloaded
from the NCBI Entrez Gene FTP repository [1] and used to generate the corre-
sponding seed sets using Entrez Gene names. Additionally, we filtered the genes
absent from at least one of the networks or for which the conversion between
Ensembl and Entrez did not succeed. In total, 94 seed genes were lost (14 with
no conversion, 80 absent from the PPPIN). A single occurrence of a gene with
several Entrez aliases happened. In this case, only the alias present in the PPPIN
was kept. For validation purposes, seed sets containing randomly selected genes
were generated. The number of seeds in each set was randomly chosen in the
range [5, 100] and the genes were randomly selected from the Human STRING
v8.2 network. 546 genes were retrieved.
3.2 Evaluation Measures and Experimental Setting
Evaluation measures Ideally, in a leave-one-out cross-validation scheme, we
would expect the prioritization strategy to rank the left-out gene known to be
related with the disease at the top. Under this assumption, we assess the perfor-
mance of the scoring methods overall and per disease based on four evaluation
measures: the number of left-out genes ranked in the top 10 and 20 positions,
the Area under the ROC curve (AUC) score, and the mean average precision.
For a given combination of diffusion parameter α and number of iterations
N , n rankings are generated (one per left-out gene). The AUC score is given by
(N )
Pn rk
n− k=1 m(N )
k
SAU Cα,N = ,
n
(N )
where rk is the ranking position of the k th left-out gene in the k th ranked list
(N )
and mk is the number of ranked genes in the k th list.
Mean average precision (MAP) is an evaluation measure that combines preci-
sion and recall. Essentially, MAP averages the precisions computed by truncating
the list after each of the relevant entities is found. Only one relevant entity must
be found, the left-out gene. Thus, precision at rank r is either 0, before it has
been found, or 1r . Moreover, in our setting the ranked lists contain equal number
of genes, allowing us to simplify our MAP score for n lists with the same size to:
Pn 1
k=1 r (N )
k
SM APα,N =
n
Experimental setting In each validation run, one different gene was deleted
from the set of seed genes and added to 99 randomly selected candidate genes.
A ranking method was then applied to compute a score for every gene in the
network. Finally, the ranking of the 100 candidate genes was defined according
to the retrieved scores. In the case of the Heat Diffusion, the scores of the seed
genes were initialized to 1. For PageRank, an initial seed score of 1/|S| was used.
Performance was assessed by computing AUC and MAP scores, and counting the
number of left-out genes ranked in the top 10 and top 20 positions, both overall
and per disease. We sought the best performance of each method using several
combinations of parameters. Heat diffusion coefficients t and back probabilities β
of 0.1, 0.3, 0.5, 0.7 and 0.9 with 2, 5, 10, 15 and 20 iterations using STRING and
2, 5, 10, 20, 100 iterations using the PPPIN were tried. In the case studies, results
are shown only for the parameter settings which achieved the best performance
in each case. We further ranked the randomly generated seed sets using the leave-
one-out cross-validation in STRING v8.2 to assess whether the Heat Diffusion
method was able to take advantage of the information contained in the seed sets
to improve the identification of the left-out seeds. Overall, AUC and MAP scores
of 0.501 and 0.05 were achieved and only 57 and 92 genes were ranked in the top
10 and 20 positions. Similar results were obtained per seed set (data not shown),
in accordance with what would have been expected for random seed sets.
3.3 Case Studies
Heat Diffusion and PageRank with priors achieved similar results in both net-
works (Table 1). For this reason, we abstain ourselves of comparing the results
of both random walks, considering the results equivalent when applied to the
same network. Throughout this section, we will always refer to one of them as
a representative of a global measure. A brief description of the prioritization
performances obtained for each case study follows.
Method Network Parameters AUC MAP TOP 10 TOP 20 #BRM #BRN
HeatDiffusion STRING8 t = 0.3, N = 10 0.962 0.711 484 502 26% 68%
PageRank STRING8 β = 0.7, N = 2 0.961 0.693 485 502 20% 69%
HeatDiffusion PPPIN t = 0.5, N = 2 0.862 0.352 301 373 40% 11%
PageRank PPPIN β = 0.5, N = 2 0.861 0.349 304 384 38% 10%
Table 1. Results of Heat Diffusion and PageRank using both STRING v8.2 and the
PPPIN. ’#BRM’ (better ranked by method in each network) shows the percentage of
genes with a higher rank in a one-to-one comparison of the ranks per gene for both
methods in each network. ’#BRN’ (better ranked by network for each method) shows
the percentage of genes with a higher rank in a one-to-one comparison of the ranks per
gene for both networks using the same method. Total number of genes: 526.
Global measure vs Local measure A network-based global ranking was
obtained using the Heat Diffusion method with t = 0.3, N = 10, while Endeavour
[2] was used to score the genes using its local measure. Both rankings were
based on STRING v7.1, the version included in Endeavour. Overall, the random
walk global measure outperformed the local interaction overlap in all evaluation
measures (see Table 2), that is, the higher number of left-out genes was ranked
on the top positions, also achieving better ranks in general, using the latter.
Method Network AUC MAP TOP 10 TOP 20
HeatDiffusion (t = 0.3, N = 10) STRING v7.1 0.942 0.643 536 569
Endeavour STRING v7.1 0.806 0.326 393 464
Table 2. Overall results of Heat Diffusion and Endeavour using STRING v7.1. Total
number of genes: 620.
Regarding the AUC scores per disease (see Table 3), the Heat Diffusion
method outperformed Endeavour in all diseases except Ehlers-Danlos syndrome
(0.944 opposed to 0.948, respectively). This was also the only disease for which
the number of genes ranked in the top 20 positions was higher using the local
measure (Endeavour was able to rank one more gene in the top 20). However,
the MAP score was better for the Heat Diffusion method and, in fact, 9 of the 10
seed genes ranked in the top 10 positions by both methods scored higher using
the global measure.
For the remaining diseases, Heat Diffusion was always able to rank the same
or a higher number of genes in both the top 10 and the top 20 positions. Regard-
ing the MAP scores, Heat Diffusion outperformed Endeavour in every disease
and was able to rank all genes of both amyotrophic lateral sclerosis and Usher
syndrome in the first position.
Heat Diffusion Endeavour
STRING v7.1 STRING v7.1
Disease #Genes AUC MAP Top10 Top20 AUC MAP Top10 Top20
Alzheimer’s disease 8 0.934 0.586 7 7 0.930 0.376 6 7
amyotrophic lateral sclerosis 4 0.990 1.000 4 4 0.975 0.550 4 4
anemia 44 0.928 0.499 36 40 0.718 0.187 21 30
breast cancer 24 0.930 0.608 21 22 0.782 0.214 13 19
cardiomyopathy 22 0.973 0.812 21 21 0.862 0.579 18 18
cataract 20 0.890 0.693 15 16 0.883 0.363 13 17
Charcot-Marie-Tooth disease 14 0.889 0.752 12 12 0.738 0.361 8 8
colorectal cancer 21 0.961 0.697 19 20 0.918 0.389 17 20
deafness 42 0.941 0.642 37 40 0.732 0.186 17 25
diabetes 26 0.967 0.731 22 26 0.820 0.232 17 21
dystonia 5 0.986 0.867 5 5 0.938 0.381 4 5
Ehlers-Danlos syndrome 10 0.944 0.650 9 9 0.948 0.296 9 10
emolytic anemia 13 0.965 0.683 12 12 0.737 0.269 8 8
epilepsy 15 0.989 0.933 15 15 0.749 0.612 10 10
ichthyosis 9 0.881 0.598 8 8 0.778 0.226 6 6
leukemia 112 0.922 0.428 88 100 0.807 0.203 68 86
lymphoma 31 0.920 0.420 24 25 0.796 0.275 19 22
mental retardation 24 0.918 0.629 21 21 0.624 0.110 7 11
muscular dystrophy 24 0.981 0.780 24 24 0.869 0.390 19 21
myopathy 41 0.961 0.594 37 39 0.885 0.535 34 34
neuropathy 18 0.965 0.671 14 17 0.648 0.205 8 9
obesity 13 0.931 0.796 12 12 0.918 0.559 12 12
Parkinson’s disease 9 0.903 0.728 7 7 0.661 0.158 4 4
retinitis pigmentosa 30 0.957 0.882 27 28 0.845 0.470 22 23
spastic paraplegia 7 0.930 0.860 6 6 0.927 0.586 5 6
spinocerebellar ataxia 7 0.959 0.863 6 6 0.816 0.250 3 6
Usher syndrome 8 0.990 1.000 8 8 0.988 0.917 8 8
xeroderma pigmentosum 10 0.987 0.850 10 10 0.785 0.704 7 7
Zellweger syndrome 9 0.989 0.944 9 9 0.823 0.513 6 7
Table 3. Results of the Heat Diffusion (t = 0.3, 10 iterations) and Endeavour methods
using STRING v7.1, per disease. Total number of genes: 620.
Protein-Protein Associations vs Protein-Protein Physical Interactions
Heat Diffusion achieved better performance using STRING v8.2, with AUC score
0.962, opposed to 0.862 using the PPPIN (see Table 1). Furthermore, STRING
enabled to rank more than 90% of the genes in the top 10 positions, while using
the PPPIN less than 60% were in top 10. In a one-to-one comparison, Heat
Diffusion ranked 68% of the genes better using STRING, while only 11% of the
ranks were better using the PPPIN. Table 4 compares the results obtained for
the Heat Diffusion method using STRING v8.2 with PageRank with priors in a
PPPIN, one of the best performing strategies in [8], per disease.
Heat Diffusion PageRank
STRING v8.2 PPPIN
Disease #Genes AUC MAP Top10 Top20 AUC MAP Top10 Top20
Alzheimer’s disease 8 0.929 0.877 7 7 0.668 0.456 5 5
amyotrophic lateral sclerosis 4 0.990 1.000 4 4 0.530 0.028 0 1
anemia 37 0.967 0.599 35 36 0.679 0.268 15 19
breast cancer 22 0.952 0.618 20 20 0.877 0.427 17 18
cardiomyopathy 19 0.986 0.904 19 19 0.789 0.383 12 13
cataract 16 0.980 0.781 16 16 0.751 0.485 10 11
Charcot-Marie-Tooth disease 10 0.934 0.735 9 9 0.665 0.251 3 3
colorectal cancer 29 0.969 0.785 19 19 0.912 0.382 15 19
deafness 28 0.950 0.623 23 27 0.547 0.210 7 8
diabetes 25 0.966 0.743 23 24 0.838 0.422 17 20
dystonia 5 0.986 0.800 5 5 0.700 0.316 2 2
Ehlers-Danlos syndrome 8 0.990 1.000 8 8 0.850 0.613 6 7
emolytic anemia 12 0.978 0.772 12 12 0.793 0.149 4 6
epilepsy 13 0.989 0.962 13 13 0.803 0.454 8 8
ichthyosis 7 0.954 0.768 6 6 0.651 0.367 3 3
leukemia 98 0.948 0.520 86 93 0.811 0.209 50 67
lymphoma 26 0.930 0.476 21 22 0.850 0.270 15 18
mental retardation 19 0.926 0.727 16 17 0.739 0.303 8 12
muscular dystrophy 20 0.983 0.790 20 20 0.893 0.524 15 15
myopathy 35 0.969 0.702 33 35 0.731 0.272 20 24
neuropathy 17 0.951 0.699 15 15 0.636 0.201 5 8
obesity 12 0.988 0.917 12 12 0.892 0.621 10 10
Parkinson’s disease 8 0.935 0.878 7 7 0.754 0.465 5 5
retinitis pigmentosa 23 0.981 0.883 22 23 0.736 0.310 11 12
spastic paraplegia 5 0.990 1.000 5 5 0.490 0.083 1 1
spinocerebellar ataxia 7 0.957 0.768 6 6 0.726 0.095 3 4
Usher syndrome 4 0.990 1.000 4 4 0.880 0.631 3 3
xeroderma pigmentosum 10 0.988 0.900 10 10 0.980 0.811 10 10
Zellweger syndrome 8 0.990 1.000 8 8 0.871 0.814 7 7
Table 4. Heat Diffusion using STRING v8.2 (t = 0.3, N = 10) vs PageRank with
priors using the PPPIN (β = 0.5, N = 2), per disease. Total number of genes: 526.
Regarding the disease-specific scores (see Table 4), the lowest AUC (and
MAP) values for the combination Heat Diffusion and STRING v8.2 were of
0.926 (0.727) for mental retardation, and 0.930 (0.476) for lymphoma, which are
still good results. For five diseases, namely amyotrophic lateral sclerosis, Ehlers-
Danlos syndrome, spastic paraplegia, Usher syndrome and Zellweger syndrome,
the heterogeneous association network approach was actually able to rank all the
seed genes in the first position of the ranking. On the other hand, the PageRank
diffusion in the PPPIN achieved AUC scores above 0.9 only for two diseases:
colorectal cancer with 0.912 and xeroderma pigmentosum with 0.98. The lowest
AUC and MAP scores were obtained for amyotrophic lateral sclerosis (0.53 and
0.028) and spastic paraplegia (0.49 and 0.083). The PPPIN strategy could not
rank any of the seed genes for amyotrophic lateral sclerosis in the top 10 positions
and only one was identified in the first 20. Also, only one gene out of the 5 seeds
for spastic paraplegia was ranked in the top 10/20. In this case, the performance
for both diseases is comparable to the one obtained using the random seed sets
(data now shown).
Confidence weights, number of iterations and diffusion rate We assessed
the contribution of STRING’s weights expressing the degree of confidence in the
associations between genes to the performance of the prioritization method by
diffusing the initial preference vector using the filtered disease-specific seed sets
on the network after setting all associations’ weights to 1. Although the resulting
AUC and MAP scores (0.957 and 0.662) were not substantially different from
the ones obtained using the confidence weights (0.962 ans 0.711), they actually
reflected in less 9 genes ranked in the top 10 (data not shown). Overall, the
number of genes in the top 20 was the same, with slight variations per disease.
From the five diseases achieving maximum performance in the differentially as-
sociation weighted setting, only for Ehlers-Danlos syndrome, spastic paraplegia
and Zellweger syndrome these results could be maintained.
In both random walk approaches, the best results were achieved using a
limited number of iterations. STRING v8.2 provided consistent and stable per-
formance when varying the number of diffusion steps. On the PPPIN, the best
ranking was always obtained using two iterations. It would then stabilize for
larger numbers of steps, although measuring considerably lower in the evalua-
tion, since it was never able to rank more than 289 or 346 genes - out of 526 -
in the top 10 and top 20, respectively.
Regarding the parameter controlling the rate of diffusion, the Heat Diffusion
method delivered quite similar performance for the set of heat coefficients tried:
in STRING v8.2, resulting in AUC scores ranging from 0.960 to 0.962 for each
diffusion coefficient, considering equal number of iterations; in the PPPIN, AUC
scores ranging between 0.859 and 0.862 with 2 iterations, N = 2, and between
0.766 and 0.771 using 5, 10, 20 and 100 iterations.These results indicate its ro-
bustness to variations in this parameter. For PageRank with priors, the impact
of the back probability value was not neglegible. For the lowest back probabili-
ties (0.01 and 0.05) the scores were unstable leading to considerable performance
variations, even using STRING v8.2. For β = {0.1, 0.3, 0.5, 0.7, 0.9}, the PageR-
ank AUC scores in STRING v8.2 varied between 0.936 and 0.961 considering the
results obtained using the same number of iterations. In the PPPIN, PageRank
obtained AUC scores between 0.859 and 0.861 using 2 iterations and ranging
between 0.758 and 0.775 using 5, 10, 20 and 100 iterations.
4 Conclusions
Prioritization results confirmed our hypothesis that networks integrating gene
associations retrieved or predicted using data from heterogeneous sources should
be in general more informative and potentially able to perform better in the
identification of genes associated with a given disease when compared to networks
containing only physical interactions. Advantages of the former are supported by
three key observations: (1) associations derived from the combination of several
types of evidence should be more reliable and accurate; (2) heterogeneous data
integration enables a better coverage of the genome and larger network density,
confering robustness to noise; (3) confidence weights can be devised in order to
differentiate associations and mitigate the impact of false positive associations,
particularly when based on a limited number of sources.
Nevertheless, our analysis shows that heterogeneous association networks do
not present sufficient guarantee for maximum performance by themselves. In fact,
the network-based score measuring the degree of relatedness of each candidate
gene with a given disease based on a set of known disease-related genes proved
to play a major role. Essentially, based on the results we could conclude that in
comparison to neighborhood-limited scores a network-based measure able to cap-
ture global connectivity properties by considering indirect associations between
genes is not only (1) more robust, as it compensates for the sparsity related to
direct associations and tackles the “small world” effect issue; but also (2) more
informative, deriving a score based on a systemic view of the interactome. This
claim has also been previously hinted at in [15, 16].
Propagation schemes tested in the computation of global network-based scores
diffused an initial preference vector expressing the distribution of the known
disease-related genes through the network using random walks. These methods
compute fast using iterative procedures, even for large networks. Furthermore,
we could verify that in the context of prioritization in association or physical
interaction networks the maximum performance can be achieved using only a
limited number of iterations. Heat Diffusion and PageRank with priors delivered
high quality results and achieved similar performance under appropriate param-
eter settings, supporting the claim of equivalence [8, 25] for other approaches of
the same kind, namely HITS with priors and K-Step Markov. The importance of
confidence weights was inconclusive, as the difference in performance exhibited
by our experiments was residual. We believe, however, that appropriate associ-
ation confidence weights may improve accuracy of network-based prioritization
results.
Acknowledgments This work was partially supported by FCT (INESC-ID
multiannual funding) through the PIDDAC Program funds. JPG is the recipient
of a doctoral grant supported by FCT (SFRH/BD/36586/2007).
References
1. NCBI Entrez Gene FTP Repository (Jan 2010), ftp://ftp.ncbi.nih.gov/gene/
2. Aerts, S., Van Loo, P., De Smet, F., Lambrechts, D., Maity, S., Tranchevent, L.C.,
De Moor, B., Coessens, B., Marynen, P., Hassan, B., Carmeliet, P., Moreau, Y.:
Gene prioritization through genomic data fusion. Nature Biotechnology 24(5), 537–
44 (2006)
3. Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A.,
Dolinski, K., Dwight, S., Eppig, J., Others: Gene Ontology: tool for the unification
of biology. Nature Genetics 25(1), 2529 (2000)
4. Bader, G.D., Betel, D., Hogue, C.W.V.: BIND: the Biomolecular Interaction Net-
work Database. Nucleic Acids Research 31(1), 248–250 (2003)
5. Breitkreutz, B.J., Stark, C., Reguly, T., Boucher, L., Breitkreutz, A., Livstone, M.,
Oughtred, R., Lackner, D.H., Bahler, J., Wood, V., Dolinski, K., Tyers, M.: The
BioGRID Interaction Database: 2008 update. Nucleic Acids Research 36(suppl 1),
D637–640 (2008)
6. Can, T., Çamoğlu, O., Singh, A.K.: Analysis of protein-protein interaction net-
works using random walks. In: Proceedings of the 5th International Workshop on
Bioinformatics - BIOKDD ’05. p. 61. ACM Press, New York, New York, USA
(2005)
7. Chatr-Aryamontri, A., Zanzoni, A., Ceol, A., Cesareni, G.: Searching the protein
interaction space through the MINT Database. Methods in Molecular Biology 484,
305–317 (2008)
8. Chen, J., Aronow, B.J., Jegga, A.G.: Disease candidate gene identification and pri-
oritization using protein interaction networks. BMC Bioinformatics 10, 73 (2009)
9. Chung, F., Yau, S.: Coverings, heat kernels and spanning trees. Electronic Journal
of Combinatorics 6, R12 (1999)
10. Francisco, A.P., Gonçalves, J.P., Madeira, S.C., Oliveira, A.L.: Using personalized
ranking to unravel relevant regulations in the Saccharomyces cerevisiae regulatory
network. In: Jornadas de Bioinformática 2009. Lisbon, Portugal (2009)
11. Franke, L., Bakel, H., Fokkens, L., Jong, D., E.d, Egmont-petersen, M., Wijmenga,
C.: Reconstruction of a functional human gene network, with an application for
prioritizing positional candidate genes. The American Journal of Human Genetics
78, 1011–1025 (2006)
12. Jensen, L.J., Kuhn, M., Stark, M., Chaffron, S., Creevey, C., Muller, J., Doerks,
T., Julien, P., Roth, A., Simonovic, M., Bork, P., von Mering, C.: STRING 8–a
global view on proteins and their functional interactions in 630 organisms. Nucleic
acids research 37(Database issue), D412–6 (2009)
13. Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M.: The KEGG re-
source for deciphering the genome. Nucleic Acids Research 32(suppl 1), D277–280
(2004)
14. Kerrien, S., Alam-Faruque, Y., Aranda, B., Bancarz, I., Bridge, A., Derow, C.,
Dimmer, E., Feuermann, M., Friedrichsen, A., Huntley, R., Kohler, C., Khadake,
J., Leroy, C., Liban, A., Lieftink, C., Montecchi-Palazzi, L., Orchard, S., Risse, J.,
Robbe, K., Roechert, B., Thorneycroft, D., Zhang, Y., Apweiler, R., Hermjakob, H.:
IntAct–open source resource for molecular interaction data. Nucleic Acids Research
35(suppl 1), D561–565 (2007)
15. Köhler, S., Bauer, S., Horn, D., Robinson, P.: Walking the interactome for pri-
oritization of candidate disease genes. The American Journal of Human Genetics
82(4), 949958 (2008)
16. Linghu, B., Snitkin, E.S., Hu, Z., Xia, Y., Delisi, C.: Genome-wide prioritization of
disease genes and identification of disease-disease associations from an integrated
human functional linkage network. Genome biology 10(9), R91 (2009)
17. McKusick, V.A.: Mendelian Inheritance in Man and Its Online Version, OMIM.
The American Journal of Human Genetics 80(4), 588–604 (2007)
18. von Mering, C., Jensen, L.J., Kuhn, M., Chaffron, S., Doerks, T., Krüger, B., Snel,
B., Bork, P.: STRING 7–recent developments in the integration and prediction of
protein interactions. Nucleic Acids Research 35(Database issue), D358–62 (2007)
19. Mishra, G.R., Suresh, M., Kumaran, K., Kannabiran, N., Suresh, S., Bala, P.,
Shivakumar, K., Anuradha, N., Reddy, R., Raghavan, T.M., Menon, S., Hanu-
manthu, G., Gupta, M., Upendran, S., Gupta, S., Mahesh, M., Jacob, B., Mathew,
P., Chatterjee, P., Arun, K.S., Sharma, S., Chandrika, K.N., Deshpande, N., Pal-
vankar, K., Raghavnath, R., Krishnakanth, R., Karathia, H., Rekha, B., Nayak, R.,
Vishnupriya, G., Kumar, H.G.M., Nagini, M., Kumar, G.S.S., Jose, R., Deepthi, P.,
Mohan, S.S., Gandhi, T.K.B., Harsha, H.C., Deshpande, K.S., Sarker, M., Prasad,
T.S.K., Pandey, A.: Human protein reference database–2006 update. Nucleic Acids
Research 34(suppl 1), D411–414 (2006)
20. Nitsch, D., Tranchevent, L.C., Thienpont, B., Thorrez, L., Van Esch, H., Devriendt,
K., Moreau, Y.: Network analysis of differential expression for the identification of
disease-causing genes. PloS ONE 4(5), e5526 (2009)
21. Schaefer, C.F., Anthony, K., Krupa, S., Buchoff, J., Day, M., Hannay, T., Buetow,
K.H.: PID: the Pathway Interaction Database. Nucleic Acids Research 37(suppl 1),
D674–679 (2009)
22. Tiffin, N., Andrade-Navarro, M.A., Perez-Iratxeta, C.: Linking genes to diseases:
it’s all in the data. Genome Medicine 1(8), 77 (Jan 2009)
23. Vanunu, O., Magger, O., Ruppin, E., Shlomi, T., Sharan, R.: Associating Genes and
Protein Complexes with Disease via Network Propagation. PLoS Computational
Biology 6(1) (2010)
24. Vastrik, I., D’Eustachio, P., Schmidt, E., Joshi-Tope, G., Gopinath, G., Croft, D.,
de Bono, B., Gillespie, M., Jassal, B., Lewis, S., Matthews, L., Wu, G., Birney, E.,
Stein, L.: Reactome: a knowledge base of biologic pathways and processes. Genome
Biology 8(3), R39 (2007)
25. White, S., Smyth, P.: Algorithms for estimating relative importance in networks.
In: KDD ’03: Proceedings of the Ninth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining. pp. 266–275. ACM, New York, NY,
USA (2003)
26. Wu, X., Jiang, R., Zhang, M.Q., Li, S.: Network-based global inference of human
disease genes. Molecular Systems Biology 4(189), 189 (2008)
27. Xenarios, I., Rice, D.W., Salwinski, L., Baron, M.K., Marcotte, E.M., Eisenberg,
D.: DIP: the Database of Interacting Proteins. Nucleic Acids Research 28(1), 289–
291 (2000)
28. Yang, H., King, I., Lyu, M.: Diffusionrank: a possible penicillin for web spam-
ming. In: Proceedings of the 30th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval. p. 438. ACM (2007)