-

Exploiting Semantics to Predict Potential Novel Links from Dense Subgraphs

Alejandro Flores

Maria-Esther Vidal

Guillermo Palma

gpalmag@ldc.usb.ve 0 0 Universidad Simo ́ n Bol ́ıvar , Caracas , Venezuela

Knowledge graphs encode semantic knowledge that can be exploited to enhance different data management tasks, e.g., query answering, ranking, or data mining. We tackle the problem of predicting interactions between drugs and targets, and propose esDSG, an unsupervised approach able to predict links from subgraphs that are not only highly dense, but that comprise both similar drugs and targets. The esDSG approach extends a state-of-the-art approximate densest subgraph algorithm with knowledge about the semantic similarity of the nodes in the original graph, and then predicts potential novel interactions from the computed dense subgraph. We have conducted an initial experimental study on a benchmark of drug-target interactions. Our observed results suggest that esDSG is able to identify interactions in graphs where existing approaches cannot perform equality well. Further, a large number of esDSG predictions can be validated using external databases as STITCH1 and Kegg2. These results, although initial, reveal how semantics in conjunction with topological information of the knowledge graph may have a great impact on pattern discovery tasks.

Knowledge graphs are networks of semantically related concepts, described in terms of attributes, types, and relationships. For example, biomedical knowledge graphs represent relationships among drugs, targets, side effects, and so on. Exploring these semantically rich networks can lead to novel discoveries [ 3 ], e.g., interactions between drugs and targets, or drug side effects. We focus on the problem of suggesting novel associations between concepts from knowledge graphs where concepts of the same type are characterized by similarity measures. We apply our approach to predict drug-target interactions; however, the approach is general enough to be used in other domains.

Different approaches have been proposed to predict interactions in networks [ 1, 4, 5, 10, 14, 15 ]; these approaches have been applied to drug-target graphs, where edges between drugs and targets represent drug-target interactions, and drug-drug and targettarget similarity functions are defined. The prediction hypothesis is that similar drugs interact with the same targets, and similar targets interact with the same drugs [ 4 ]. The main challenge addressed by existing approaches is the development of strategies that allow for detecting portions of the network where novel predictions can be suggested. To illustrate the problem of suggesting drug-target interactions, consider Figures 1(a),

1 http://stitch.embl.de/ 2 http://www.genome.jp/kegg/

(b) Dense Subgraph (a) Drug-target interactions (c) Edge-Similarity Dense Subgraph (b), and (c). Figure 1(a) visualizes the network of drug-target interactions of Ion Channels in the dataset studied by Ding et al. [ 3 ]; drug-drug and target-target relationships are not drawn. To avoid the pair-wise comparison of all the drugs and targets in the graph, existing approaches limit the search to dense portions of the graph that comprise both similar drugs and similar targets. For example, the semEP [ 10 ] approach represents the network of drug-target interactions as a bipartite graph, and partitions the edges (interactions) of the graph according to a density function that measures how similar are the drugs and targets that are placed in one partition. Further, the number of interactions between drugs and targets in each partition are taken into account by this measure. semEP is able to explore portions of the search space and suggest predictions that other state-of-the-art approaches do not discover. However, in graphs such as the one reported in Figure 1(a), semEP and state-of-the-art approaches do not exhibit as good behavior as in other networks. The Ion Channel graph is composed of 414 nodes and 1,476 edges. It also has three connected components; two of these components are comprised of five nodes and three edges. The rest of the nodes are part on the main connected component, from which the dense subgraph of the original network can be computed; this dense subgraph is presented in Figure 1(b). We hypothesize that in graphs with this topology, limiting the search to highly dense components could lead to better predictions. However, dissimilar drugs or targets may comprise these highly connected components. We address the problem of identifying subgraphs that are not only dense, but are composed of highly similar drugs and targets. We name these graphs edge-similarity dense subgraphs; Figure 1(c) shows an edge-similarity dense subgraph for the Ion Channel dataset. Evaluating the prediction hypothesis in this subgraph allows for suggesting seven predictions shown in red-dashed lines.

We propose a novel technique named esDSG, that is able to produce edge-similarity dense subgraphs and to suggest novel associations from these subgraphs. We have conducted an experimental study on the benchmark of drug-target interactions studied by Ding et al. [ 3 ]. Interactions suggested by semEP and five state-of-the-art machine learning based techniques are compared to the interactions produced by esDSG. The observed results suggest that the search spaces explored by all these approaches are different. Further, esDSG identifies interactions in graphs where existing approaches cannot perform equality well. More importantly, a large number of the esDSG predictions can be validated using the external databases STITCH and Kegg. These results suggest that considering the topology in conjunction with semantics encoded in the graph, e.g., similarity values, can have a positive impact on the effectiveness of discovery tasks.

The rest of the paper is as follows: Section 2 describes the esDSG approach and Section 3 summarizes related approaches. Experimental results are presented in Section 4. Finally, we conclude in Section 5 with an outlook to future work. 2

Edge-Similarity Densest Subgraph Problem

Our goal is to discover novel drug-target interactions by exploiting relationships among existing interactions and values of similarity between both drugs and targets. However, to effectively suggest such interactions, the prediction hypothesis should be evaluated in graphs that maximize: i) graph density, and ii) drug-drug and target-target similarities. We cast the problem of generating these graphs into a single-objective optimization problem, called the edge-similarity Densest Subgraph (esDSG) problem.

Let D = fd1; d2; : : : ; dng be a set of drugs, and let sd(di; dj ) 2 [0 : : : 1] be the similarity score function defined between every two drugs in D. Similarly, let T = ft1; t2; : : : ; tmg be a set of targets, and let st(ti; tj ) 2 [0 : : : 1] be the similarity score function defined between every two targets in T . Given the sets T and D, we call E the set of interactions between targets and drugs, where E T D. These interactions can be naturally represented as a bipartite graph G = (T; D; E).

For each node v in the subgraph, the goal is to maximize not only the number of interactions (i.e., the degree), but also the average of the pair-wise similarity values of v with respect to the rest of the nodes in the subgraph. To do this, we define the similarity measure of a single node v as follows: Definition 1 (Single-node similarity). Consider a bipartite graph G = (T; D; E) of drug-target interactions. The single-node similarity of v 2 T (resp., v 2 D) corresponds to the arithmetic mean of the similarity measure values of v with every node v0 2 T (resp., v0 2 D). This is denoted as sn(v).

sn(v) = 8 1 P st(v; t) >< jT j t2T

1 P sd(v; d) >: jDj d2D if v 2 T if v 2 D (1) is dissimilar to the rest of the drugs in the subgraph. This is represented with low values of the single-node similarity, i.e., sn(D00332) = 31 1 + 0:16 + 0:17 . On the other hand, the drug D05077 is highly similar to the rest of the drugs in this subgraph; thus, a higher value of single-node similarity is assigned to D05077, i.e., sn(D05077) = 13 1 + 0:97 + 0:17 . Note that single-node similarity values depend on the context in which they are computed, i.e., only the nodes in a particular subgraph are considered. (a) (b)

We can characterize edges in a graph based on the single-node similarity values of the nodes that connect each edge. For example, because D00332 is less similar than D05077 to the rest of the drugs in the subgraph of Figure 2(a), the interaction that relates D00332 with the target SCN5A is not as important in terms of similarity as the interaction between D05077 the same target SCN5A. We model the importance of an interaction in terms of similarity as a function named edge similarity, that averages the single node similarities of the target and the drug that are related with this interaction. Definition 2 (Edge similarity). Consider a bipartite graph G = (T; D; E) of drugtarget interactions, and the single-node similarity value of every drug and target in the graph. The edge similarity score of an interaction (t; d) 2 E, denoted as se(t; d), corresponds to the arithmetic mean of the single-node similarity values of t and d. se(t; d) =

Based on this definition, the edge similarity of the interaction between D00332 and SCN5A is 0.63, while the edge similarity of the interaction between D05077 and SCN5A is 0.76. These values of the edge similarity function indicate the importance of these two interactions in terms of the similarity of the drugs and targets that they relate, with respect to the rest of the drugs and targets in the subgraph.

Normally, the definition of density for a bipartite graph, denoted as D(G), is given jEj . Using this definition, each edge of the graph is equally relevant by D(G) = pjT jjDj in terms of the density of the graph. Our approach relies on the importance of edges in terms of similarity, i.e., the edge similarity, to compute a similarity-based density. Definition 3 (Edge-Similarity Density). Consider a bipartite graph G = (T; D; E), where T represents the set of targets, D the set of drugs, and E the set of drug-target interactions. Let se(t; d), for (t; d) 2 E, be the edge similarity value for every edge in the graph. The Edge-Similarity Density of G can be defined as: (3) (4) Des(G) =

P (t;d)2E

se(t; d) pjT jjDj

Using the edge similarity values computed for every edge in the subgraph in Figure 2(a), the edge-similarity density is equal to 1.19, while the graph density for this subgraph is 1.63. Removing the drug D00332 from this subgraph results in the subgraph G0 in Figure 2(b) with a decreased density value of 1.50. However, the edge-similarity density in this new graph is higher, i.e., Des(G0) = 1:35. Applying the prediction hypothesis to this subgraph would result in the prediction of an interaction between the target SCN2A and the drug D05077; this interaction can be validated in Kegg.

Based on the definition of the edge-similarity density, we define the Edge-Similarity Densest Subgraph problem (esDSG) as follows:

Definition 4 (Edge-Similarity Densest Subgraph problem (esDSG)). Given a bipar

tite graph G = (T; D; E) as described above, representing the set of interactions between targets and drugs, esDSG identifies the subgraph G G such that the edgesimilarity density of G , Des(G ), is maximized. 2.1

A greedy approximate algorithm for esDSG

We propose a greedy approximate algorithm to the esDSG problem. The resulting subgraphs are both highly dense and similar, and they will be used to predict novel interactions between targets and drugs. The edge-similarity density is used to guide the greedy algorithm into spaces of dense and similar subgraphs.

The greedy decision of removing the nodes that contribute the least to the edgesimilarity density may result in higher values of the measure. For example, as shown in Figure 2, removing the drug D00332 with both minimum single-node similarity and degree values, as well as removing the interaction between this drug and the target SCN5A, leads to an increase of the edge-similarity density from 1.19 to 1.35.

After applying the definition of edge similarity and some algebraic manipulation, the edge-similarity density can be reformulated as a function of the degree values and the single-node similarity of each node in a graph G, as follows:

Des(G) = v2T [D

P degree(v) sn(v) 2pjT jjDj

We propose a modification of the densest subgraph 2-approximation algorithm proposed by Khuller and Barna [ 7 ]. In our algorithm, the greedy decision will be made not only in terms of the degree, but also the single-node similarity of the nodes of the graph. Along with the use of the edge-similarity density, this constitutes the most important difference with Khuller & Barna’s algorithm, where the greedy strategy indicates the removal of nodes, according only to the degree.

Algorithm 1: Edge-Similarity Dense Subgraph

Input: A bipartite graph G = (T; D; E)

Output: A subgraph H G 1 i jT j + jDj; Hi G 2 while Hi 6= ? do 3 Let v be a vertex with minimum value degree(v) sn(v) in Hi 4 Delete all incident edges in v 5 Delete all nodes with no incident edges 6 Update single-node similarity values of all nodes 7 Call the new graph Hi 1, i i 1 8 return H , a graph with maximum edge-similarity density, Des(H ), from all the Hi’s

Algorithm 1 performs an ordered removal of the nodes of a bipartite graph, removing a node v with minimum product of the single-node similarity and degree values, i.e., a node such that degree(v) sn(v) is the minimum value. Further, the incident edges on v are removed from the graph. The algorithm iteratively performs this greedy decision until no more edges can be removed. In each iteration i of the algorithm (while loop in lines 2 to 7), a subgraph Hi is created; the algorithm outputs the subgraph H with the maximum value of Des(H ). Predictions correspond to the missing edges in H . For example, consider the subgraph in Figure 1(c), this is a dense graph produced by Algorithm 1 for the Ion Channel dataset; seven red-dashed lines represent missing edges in this subgraph and correspond to the interactions predicted by our approach. It is important to highlight that four out of the seven predicted interactions could be validated in STITCH with prediction score greater than 0.16; only two of these interactions were among the top-10 predictions of semEP.

Finally, for a bipartite graph G = (T; D; E), Khuller’s algorithm requires O(jEj + jT [ Dj) time, while the time complexity of Algorithm 1 is O(jT j2 + jDj2). The increased complexity is a consequence of the need to compute the single-node similarity values. i.e., after each removal the sets of drugs and targets changes, thus, the singlenode similarity values of the nodes in the graph need to be updated. 3

Related Work

We briefly describe existing approaches for graph data mining and link prediction. Graph mining approaches focus on the problem of detecting patterns in graphs by conducting clustering and ranking methods on knowledge graphs [ 6, 9, 12 ]. RankClus [ 12 ], GNetMine [ 6 ], and JointCluster [ 9 ] apply clustering techniques to place multityped concepts in the same cluster. RankClus exploits link analysis-based ranking with clustering to assign highly ranked entities to highly ranked clusters. GNetMine applies learning based methods on different graph properties, e.g., graph topology, type attributes, or edge labels, to classify and partition nodes in a graph. Finally, JointCluster [ 9 ] is a simultaneous clustering such that nodes within each cluster in the partition are highly connected, and the number of inter-cluster edges is minimized. These clustering methods have shown to be effective, but they cluster nodes instead of edges. Further, they mainly focus on the topology of the graph and do not exploit any semantics, constraints, or similarity information between the nodes to enhance the quality of the clustering. Dissimilar nodes may be placed in a cluster, reducing thus the chances to predict edges whenever the prediction hypothesis is evaluated.

Several approaches have been proposed for link prediction, and particularly, to suggest annotations for genes [ 11, 13 ] and interactions between drugs and targets [ 1, 4, 5, 10, 14, 15 ]. Techniques to identify dense subgraphs [ 11 ] and graph summarization [ 13 ] have been extended to predict Gene Ontology annotations of genes that comprise a clique or from super nodes in a summary of a graph, respectively. These approaches have shown to successfully analyze patterns in graphs. However, because they do not consider semantics encoded in the graph or similarity between the graph concepts, they may be grouping dissimilar concepts in a dense graph or a summary of a graph. Thus, applying the prediction hypothesis graphs produced by these approaches may lead to a small dataset of predictions or just to produce no predictions at all.

Machine learning methods [ 1, 4, 5, 14, 15 ] have been exploited to identify the portions of the knowledge graph where the application of the prediction hypothesis can lead to the discovery of drug-target interactions; a detailed evaluation of all these approaches can be found at [ 3 ]. Additionally, semEP [ 10 ] implements an unsupervised method able to partition edges that represent drug-target interactions; the prediction hypothesis is applied to each component of the partition to predict new interactions. Components of a partition maximize the graph density and pair-wise similarity of the drugs and the targets. Although all these approaches are able to suggest drug-target interactions that can be validated, they may not exhibit the same good performance in graphs as the one illustrated in Figure 1(a). We hypothesize that because esDSG reduces the prediction search to the space of edge-similarity dense graphs, the overlap between the esDSG predictions and the ones suggested by the other approaches will be small; further, esDSG may be able to exhibit better performance from this type of graphs. 4

Evaluation of esDSG and State-of-the-Art Methods

Dataset: We use the dataset created by Bleakley et al. [ 1 ] to evaluate esDSG quality. This dataset is used in the comparison of existing interaction prediction methods [ 3, 10 ]. The dataset comprises 900 drugs, 1,000 targets, and 5,000 interactions from KEGG BRITE3, BRENDA4, SuperTarget5, and DrugBank6. The dataset provides a drug-drug chemical similarity score based on the hashed fingerprints from the SMILES resource, and a target-target similarity score based on the normalized Smith-Waterman sequence similarity score. Data is divided into four groups: Nuclear receptors (NR), Gproteincoupled receptors (GPCR), Ion channels (IC) and Enzymes (E); details in Table 1. State-of-the-art Methods: We use semEP [ 10 ] and the following machine learning methods reported in [ 3 ] to evaluate the quality of the esDSG predictions: i) BLM:

3 http://www.genome.jp/kegg/brite.html 4 http://www.brenda-enzymes.de/ 5 http://bioinf-apache.charite.de/supertarget_v2/ 6 http://www.drugbank.ca/

Bipartite Local Method [ 2 ]; ii) LapRLS: Laplacian Regularized Least Squares [ 15 ]; iii) GIP: Gaussian Interaction Profile [ 14 ]; iv) KBMF2K: Kernelized Bayesian Matrix Factorization with twin Kernels [ 5 ]; and v) NBI: Network-Based Inference [ 2 ]. Additionally, Khuller & Barna’s algorithm is used as baseline. 4.1

Dense Subgraphs and Edge-Similarity Dense Subgraphs

We compare the dense subgraphs, DSG and esDGS, produced by Khuller & Barna’s algorithm and our approach, respectively. Table 2 describes these subgraphs in terms of number of drugs, targets, interactions, density, and average similarities. We can observe that DSGs are larger than esDSGs and also have greater values of the standard density. For datasets NR, IC, and E, the esDGS graphs are denser in terms of edgesimilarity density, and result on less interactions but all predicted from similar drugs or targets. However, although the DSG approximation algorithm does not maximize edge-similarity density, DSG has higher values of this measure than esDSG on GPCR. Further, DSG and esDSG have 24 common predicted interactions. This suggests that GPCR regions that are dense or similar are the same, i.e., the search spaces of the approximation algorithms for DSG and esDSG overlap. 4.2

Validation using STITCH and Kegg

We evaluate the quality of esDSG predictions in terms of the number of interactions that can be validated. We use the latest online version of STITCH [ 8 ] and Kegg to validate each drug-target interaction predicted by esDSG. Table 3(a) reports on both the number of esDSG validated drug-target interactions, and the total number of predicted interactions. We can observe that esDSG predicted 12 out of 30 predictions that could be validated in STITCH and Kegg; while the 57% of the esDSG predicted interactions in the Ion Channel dataset can be also validated. As observed in Table 2, the IC esDSG is the one with the highest values of edge-similarity density, and average target and drug similarities. Furthermore, Table 3(b) reports on the number of validated interactions out of the top-5 novel predicted interactions of all methods; novel interactions are not in the original datasets and have the highest prediction score; these scores are computed by each prediction method. It is clear that esDSG outperforms existing approaches in the Ion Channel dataset. This supports our hypothesis that novel interactions can be predicted from highly dense graphs composed of similar drugs and targets. Effectiveness of esDSG is also measured as the overlap between the esDSG predicted interactions and the top-10 interactions predicted by semEP [ 10 ], and the machine learning methods: BLM, NBI, GIP, LapRLS, and KBMF2K [ 3 ]. Top-10 predictions correspond to the interactions with the highest prediction score. Table 4 compares esDSG and the corresponding prediction method in terms of both: i) the number of interactions that both methods predicted, labelled as Equal, and ii) the number of interactions predicted by esDSG and that are not part of the top-10 predictions of the corresponding method, labelled as Diff. We can observe that esDSG predicts interactions that are not predicted by any of the other methods. These results suggest that the edge-similarity subgraphs from where esDSG generates the predicted interactions, correspond to portions of the search space that none of the other approaches is able to explore. 5

Conclusions

We defined the edge-similarity densest subgraph problem to predict drug-target interactions. We extended Kuller & Barna’s algorithm with the greedy decision of removing the nodes that contribute less to the edge-similarity density values. Experimental results suggest our approach is able to predict novel drug-target interactions from highly dense subgraphs composed of both similar drugs and targets. We plan to extend esDSG to traverse the different connected components and identify esDSGs from each one.

Bleakley and

Yamanishi . Supervised prediction of drug-target interactions using bipartite local models . Bioinformatics , 25 ( 18 ): 2397 - 2403 , 2009 .

2. F. Cheng, C. Liu,

Jiang ,

Lu ,

Li , G. Liu,

Zhou ,

Huang , and

Tang . Prediction of drug-target interactions and drug repositioning via network-based inference . PLoS computational biology , 8 ( 5 ):e1002503, 2012 .

Ding , I. Takigawa,

Mamitsuka , and

Zhu . Similarity-based machine learning methods for predicting drug-target interactions: A brief review . Briefings in Bioinformatics , 2013 .

Fakhraei ,

Huang ,

Raschid , and

Getoor . Network-based drug-target interaction prediction with probabilistic soft logic . IEEE/ACM Transactions on Computational Biology and Bioinformatics , 2014 .

Go ¨nen. Predicting drug-target interactions from chemical and genomic kernels using bayesian matrix factorization . Bioinformatics , 28 ( 18 ): 2304 - 2310 , 2012 .

Ji ,

Sun ,

Danilevsky , J. Han, and

Gao . Graph regularized transductive classification on heterogeneous information networks . In Machine Learning and Knowledge Discovery in Databases , pages 570 - 586 . Springer, 2010 .

Khuller and

Saha . On finding dense subgraphs . In Automata, Languages and Programming , pages 597 - 608 . Springer, 2009 .

Kuhn ,

Szklarczyk ,

Franceschini , C. von Mering ,

L. J.

Jensen , and

Bork . Stitch 3: zooming in on protein-chemical interactions . Nucleic acids research , 40 ( D1 ), 2012 .

Narayanan ,

Vetta ,

E. E.

Schadt , and

Zhu . Simultaneous clustering of multiple gene expression and physical interaction datasets . PLoS Computational Biology , 6 ( 4 ), 2010 .

10. G. Palma,

Vidal , and

Raschid . Drug-target interaction prediction using semantic similarity and edge partitioning . In ISWC , 2014 .

11.

Saha ,

Hoch ,

Khuller ,

Raschid , and X. Zhang. Dense subgraphs with restrictions and applications to gene annotation graphs . In RECOMB , 2010 .

12.

Sun , J. Han,

Zhao ,

Yin , H. Cheng, and

Wu . Rankclus: integrating clustering with ranking for heterogeneous information network analysis . In EDBT , 2009 .

13.

Thor ,

Anderson ,

Raschid ,

Navlakha ,

Saha ,

Khuller , and X. Zhang. Link prediction for annotation graphs using graph summarization . In ISWC , 2011 .

14. T. van Laarhoven,

S. B.

Nabuurs , and

Marchiori . Gaussian interaction profile kernels for predicting drug-target interaction . Bioinformatics , 27 ( 21 ), 2011 .

15.

Xia , L. -Y. Wu , X.

Zhou , and S. T.

Wong . Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces . BMC systems biology, 4(Suppl 2 ): S6 , 2010 .