=Paper=
{{Paper
|id=Vol-2646/37-paper
|storemode=property
|title=Prediction of New Associations between ncRNAs and Diseases Exploiting Multi-Type Hierarchical Clustering
|pdfUrl=https://ceur-ws.org/Vol-2646/37-paper.pdf
|volume=Vol-2646
|authors=Emanuele Pio Barracchia,Gianvito Pio,Domenica D'Elia,Michelangelo Ceci
|dblpUrl=https://dblp.org/rec/conf/sebd/BarracchiaPDC20
}}
==Prediction of New Associations between ncRNAs and Diseases Exploiting Multi-Type Hierarchical Clustering==
Prediction of New Associations between ncRNAs and Diseases Exploiting Multi-Type Hierarchical Clustering (Discussion Paper) Emanuele Pio Barracchia1,3 , Gianvito Pio1,3 , Domenica D’Elia2 , and Michelangelo Ceci1,3,4 1 Dept. of Computer Science - University of Bari Aldo Moro, Bari (Italy) {emanuele.barracchia, gianvito.pio, michelangelo.ceci}@uniba.it 2 CNR, Institute for Biomedical Technologies - Bari (Italy) domenica.delia@ba.itb.cnr.it 3 Big Data Laboratory, CINI Consortium - Rome (Italy) 4 Dept. of Knowledge Technologies, Jožef Stefan Institute, Ljubljana (Slovenia) Abstract. The study of functional associations between ncRNAs and human diseases is a pivotal task of modern research to develop new and more effective therapeutic approaches. Nevertheless, it is not a triv- ial task since it involves entities of different types, such as microRNAs, lncRNAs or target genes. Such a complexity can be faced by representing the involved biological entities and their relationships as a network and by exploiting network-based computational approaches able to identify new associations. However, existing methods are limited to homogeneous networks or can exploit only a limited set of the features of biological enti- ties. To overcome the limitations of existing approaches, we proposed the system LP-HCLUS, which analyzes heterogeneous networks consisting of several types of objects and relationships, each possibly described by a set of features, and extracts hierarchically organized, possibly overlap- ping, multi-type clusters that are subsequently exploited to predict new ncRNA-disease associations. Our experimental evaluation shows that, according to both quantitative (i.e., TPR@k, ROC and PR curves) and qualitative criteria, LP-HCLUS produces better results. Keywords: non-coding RNA (ncRNAs) · diseases · cancer · heteroge- neous network · clustering · link prediction 1 Introduction High-throughput sequencing technologies and recent, more efficient computa- tional approaches, have been fundamental for the rapid advances in functional genomics. Among the most relevant results, there is the discovery of thousands of non-coding RNAs (ncRNAs) with a regulatory function on gene expression. In parallel, the number of studies reporting the involvement of ncRNAs in the development of many different human diseases has grown exponentially. The Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy. first type of ncRNAs that has been discovered and largely studied is that of mi- croRNAs (miRNAs), classified as small non-coding RNAs in contrast with long non-coding RNAs (lncRNAs), that are ncRNAs longer than 200nt. While miR- NAs primarily act as post-transcriptional regulators, lncRNAs have a plethora of regulatory functions [10]. However, the number of lncRNAs for which the func- tional and molecular mechanisms are completely elucidated is still quite poor and experimental investigations are still too much expensive for being carried out without any computational pre-analysis. In the last few years, there have been several attempts to computationally predict the relationships among bi- ological entities, such as genes, miRNAs, lncRNAs, diseases [1,11,13,15]. Such methods are based on a network representation of the entities under study and on the identification of new links among nodes in the network. However, most of them are able to work only on homogeneous networks (where nodes and links are of one single type) [5], are strongly limited by the number of different node types or are constrained by pre-defined network structures. In this discussion paper, we describe the method LP-HCLUS [2], that is able to overcome these limitations. In particular, it can discover new ncRNA-disease relationships from heterogeneous attributed networks (i.e., consisting of differ- ent biological entities related by different types of relationships) with arbitrary structure. This ability allows LP-HCLUS to investigate the interactions among different types of entities, possibly leading to increased prediction accuracy. LP-HCLUS exploits a combined approach based on hierarchical, multi-type clustering and link prediction. As we will detail in the next section, a multi-type cluster is actually a heterogeneous sub-network. Therefore, the adoption of a clustering-based approach allows LP-HCLUS to base its predictions on relevant, highly-cohesive heterogeneous sub-networks. Moreover, the hierarchical organi- zation of clusters allows it to perform predictions at different levels of granularity, taking into account either local/specific or global/general relationships. 2 Method In the following, we introduce the notation and some useful definitions. Definition 1 (Heterogeneous attributed network). A heterogeneous at- tributed network is a network G = (V, E), where V is the set of nodes and E is the set of edges, and both nodes and edges can be of different types. Moreover: – T = Tt ∪ Ttr is the set of node types, where Tt is the set of target types, i.e. considered as target of the clustering/prediction task, and Ttr is the set of task-relevant types. Only nodes of target types are clustered and considered in the identification of new relationships. – Each node type Tv ∈ T defines a subset of nodes in the network, i.e., Vv ⊆ V . – Each node type Tv ∈ T is associated with a set of attributes Av = {Av,1 , Av,2 , ..., Av,mv }, i.e., nodes of the type Tv are described by the attributes Av . – R is the set of all the possible edge types. – Each edge type Rl ∈ R defines a subset of edges El ⊆ E. C1 C1, C2 C2 C1, C2, C3 C3 C1, C2, C3, C4, C5 C4 C4, C5 C5 (a) (b) Fig. 1. A hierarchy of overlapping multi-type clusters: (a) emphasizes the overlapping among multi-type clusters; (b) shows their hierarchical organization. Definition 2 (Hierarchical multi-type clustering). A hierarchy of multi- type clusters is defined as a list of hierarchy levels [L1 , L2 , . . . , Lk ], where each Li consists of a set of overlapping multi-type clusters. For each level Li , i = 2, 3, .. . . . k, ∀ G0 ∈ Li ∃ G00 ∈ Li−1 , such that G00 is a subnetwork of G0 (Fig. 1). According to these definitions, we define the task considered in this work. Definition 3 (Predictive hierarchical clustering for link prediction). Given a heterogeneous attributed network G = (V, E) and the set of target types Tt , the goal is to find: – A hierarchy of overlapping multi-type clusters [L1 , L2 , . . . , Lk ]. – A function ψ (w) : Vi1 ×Vi2 →[0, 1] for each hierarchical level Lw (w ∈ 1, 2, ..., k), where nodes in Vi1 are of type Ti1 ∈ Tt and nodes in Vi2 are of type Ti2 ∈ Tt . Each function ψ (w) maps each possible pair of nodes (of types Ti1 and Ti2 ) to a score representing the degree of certainty of their relationship. In this paper LP-HCLUS has been used to solve the task formalized in Definition 3, by considering ncRNAs and diseases as target types. Hence, we determine two distinct set of nodes denoted by Tn and Td , representing the set of ncRNAs and the set of diseases, respectively. In the following subsections, we will describe the main steps executed by LP-HCLUS (see Fig. 2 for a general overview). 2.1 Estimation of the strength of the relationship In the first phase, we estimate the strength of the relationship among all the possible ncRNA-disease pairs in the network G. In particular, we aim to com- pute a score s(ni , dj ) for each possible pair ni , dj , by exploiting the concept of meta-path. According to [14], a meta-path is a set of sequences of nodes (in- volving both target and task-relevant types) which follow the same sequence of edge types, and can be used to fruitfully represent conceptual (possibly indi- rect) relationships between two entities in a heterogeneous network. Given the ncRNA ni and the disease dj , the relationship between them can be consid- ered “certain” if there is at least one meta-path which confirms its certainty. Extracted edges w1 Estimation of the Construction of the strength of w2 hierarchy of clusters relationships ... w|Ê| Predicted relationships Hierarchy of clusters s(n1, d1) d1 n1 s(n1, d3) Prediction d3 n1 ... s(nk, di) di nk Fig. 2. General workflow of the method LP-HCLUS. Therefore, by assimilating the score associated with an interaction to its degree of certainty, we compute s(ni , dj ) as the maximum value observed over all the possible meta-paths between ni and dj . Formally: s(ni , dj ) = max pathscore(P, ni , dj ) (1) P ∈metapaths(ni ,dj ) where metapaths(ni , dj ) is the set of meta-paths connecting ni and dj , and pathscore(P, ni , dj ) is the degree of certainty of the relationship between ni and dj according to the meta-path P . In order to compute pathscore(P, ni , dj ), we represent each meta-path P as a finite set of sequences of nodes. If a sequence in P connects ni and dj , then pathscore(P, ni , dj ) = 1. Otherwise, following the same strategy introduced before, it is computed as the maximum similarity between the sequences which start with ni and the sequences which end with dj (see Fig. 3). The intuition behind this formula is that if ni and dj are not directly connected, their score represents the similarity of the nodes and edges they are connected to. The similarity between two sequences seq 0 and seq 00 is computed according to the the attributes of all nodes involved in the two sequences: fol- 0 )−valx (seq 00 )| lowing [6], if x is numeric, then sx (seq 0 , seq 00 ) = 1 − |valx (seq maxx −minx , where minx (resp. maxx ) is the minimum (resp. maximum) value, for the attribute x; if x is not a numeric attribute, sx (seq 0 , seq 00 ) = 1 if valx (seq 0 ) = valx (seq 00 ), 0 otherwise. In this solution there could be some node types that are not involved in any meta-path. In order to exploit the information conveyed by these nodes, we add an aggregation of their attribute values (the arithmetic mean for nu- merical attributes, the mode for non-numerical attributes) to the nodes that are connected to them and that appear in at least one meta-path. 2.2 Construction of a hierarchy of overlapping multi-type clusters We construct the first level of the hierarchy by identifying a set of overlapping multi-type clusters in the form of bicliques. To this aim, we perform three steps: i) Filtering, which keeps only the ncRNA-disease pairs with a score greater than (or equal to) β. The result of this step is the subset {(ni , dj )|s(ni , dj ) ≥ β} Attributes of other entities ncRNA Attributes Disease Attributes in the path Seq n. 1 2 3 4 5 6 7 8 9 Fig. 3. Sequences between the ncRNA “h19” and the disease “asthma” according to a meta-path. Sequences emphasized in yellow (1 and 9) are those starting with “h19”, while sequences emphasized in blue (4, 5, 6 and 7) are those ending with “asthma”. ii) Initialization, which builds the initial set of clusters in the form of bicliques, each consisting of a ncRNA-disease pair in {(ni , dj )|s(ni , dj ) ≥ β}. iii) Merging, which iteratively merges two clusters C 0 and C 00 into a new cluster C 000 . This step regards the initial set of clusters as a list sorted according to an ordering relationh(C 00 ). The approach adopted to build the other hierarchical levels is similar to the merging step performed to obtain L1 . The main difference is that we do not obtain bicliques, but generic multi-type clusters. Since the biclique constraint is removed, we need another stopping criterion for the iterative merging procedure. Coherently with approaches used in hierarchical co-clustering and following [12], we adopt a user-defined threshold α on the cohesiveness of the obtained clusters. In particular, two clusters C 0 and C 00 can be merged into a new cluster C 000 if h(C 000 ) > α, where h(C 000 ) is the cluster cohesiveness. This means that α defines the minimum cluster cohesiveness that must be satisfied by a cluster obtained after a merging. The iterative process stops when it is not possible to merge more clusters with a minimum level of cohesiveness α. 2.3 Prediction of new ncRNA-disease relationships In the last phase, we exploit each level of the identified hierarchy of multi-type clusters as a prediction model. In particular, we compute, for each ncRNA- disease pair, a score representing its degree of certainty on the basis of the w multi-type clusters containing it. Formally, let Cij be a cluster identified in the w-th hierarchical level in which the ncRNA ni and the disease dj appear. We compute the degree of certainty of the relationship between ni and dj as: Fig. 4. TPR@k, Precision-Recall and ROC curves results for the dataset HMDD v3.0, obtained with the best configuration (α = 0.2, β = 0.4, level = 2). ψ (w) (ni , dj ) = h Cijw , that is, we compute the degree of certainty of the new interaction as the average degree of certainty of the known relationships in the cluster. In some cases, the same interaction may appear in multiple clusters, since the proposed algorithm is able to identify overlapping clusters. In this w case, Cij represents the list of multi-type clusters in which both ni and dj appear and we aggregate their cohesiveness values according to four different strategies: maximum, minimum, average and evidence combination [9]. 3 Experiments LP-HCLUS has been run with different values of its input parameters. In par- ticular, following the results obtained in [12], we considered α ∈ {0.1, 0.2} and β ∈ {0.3, 0.4}. The considered datasets are: i) HMDD v3.0 which stores 985 miRNAs, 675 diseases and 20,859 relationships between diseases and miRNAs; ii) Integrated Dataset (ID), built by integrating multiple datasets [3,4,7,8], composed by 7,049 diseases, 70 lncRNA-miRNA relationships, 3,830 relation- ships between diseases and ncRNAs, 90,242 target genes, 26,522 disease-target associations and 1,055 ncRNA-target relationships. We compared LP-HCLUS with the following competitors: i) HOCCLUS2 [12], a biclustering algorithm that, similarly to LP-HCLUS, identifies a hierarchy of (possibly overlapping) heterogeneous clusters. It is, how- ever, limited to work with only two types of objects. Since its parameters have a similar meaning with respect to LP-HCLUS parameters, we evaluated its results with the same setting, i.e., α ∈ {0.1, 0.2} and β ∈ {0.3, 0.4}; ii) ncPred [1], a system that was specifically designed to predict new ncRNA- disease associations. ncPred cannot catch information coming from other entities in the network and it is not able to exploit features associated to nodes and links. iii) LP-HCLUS-NoLP, which corresponds to a baseline version of system LP- HCLUS, without the clustering and the link prediction steps. In particular, we consider the score obtained in the first phase of LP-HCLUS (see Section 2.1) as the final score associated with each interaction. We adopted the 10-fold cross validation on the set of known ncRNA-disease relationships and, due to absence of negative samples, we evaluated the results in Fig. 5. TPR@k, Precision-Recall and ROC curves results for the dataset ID, obtained with the best configuration (α = 0.1, β = 0.4, level = 1). terms of TruePositiveRate@k curve. Moreover, we also report the results in terms of ROC and Precision-Recall curves by considering the unknown relationships as negative examples. We remark that ROC and PR curves can only be used for relative comparison and not as absolute evaluation measures because they are spoiled by the assumption made on unknown relationships. In Figs. 4 and 5 we show some results obtained with the most promising con- figurations. From the quantitative viewpoint, we can observe that the proposed method LP-HCLUS, with the combination strategy based on the maximum, is able to obtain the best performances, for all the considered measures. From a qualitative point of view, we first performed a comparative analysis between the results obtained by LP-HCLUS against the validated interactions reported in the updated version of HMDD (i.e., v3.2 released on March 27th, 2019). We found 3,055 LP-HCLUS predictions confirmed by the new release of HMDD at the hierarchy level 1, 4,119 at level 2 and 4,797 at level 3. Next, we conducted a qualitative analysis of the top-ranked relationships predicted by LP-HCLUS us- ing ID dataset, selecting only those with a score equal to 1.0. For this purpose, we exploited MNDR v2.0, which is a comprehensive resource including more than 260,000 experimental and predicted ncRNA-disease associations for mammalian species. Also in this case, we found some associations in both MNDR and in the list of predicted associations by LP-HCLUS. A more comprehensive analysis, reporting several additional examples, can be found in the full paper [2]. 4 Conclusions In this paper, we have tackled the problem of predicting possibly unknown ncRNA-disease relationships. The proposed approach LP-HCLUS is able to take advantage from the possible heterogeneous nature of the attributed biological network analyzed. The results confirm the initial intuitions and show compet- itive performances of LP-HCLUS in terms of accuracy of the predictions, also when compared with state-of-the-art competitor systems. These results are also supported by a comparison of LP-HCLUS predictions with data reported in MNDR and by a qualitative analysis that revealed that several ncRNA-disease associations predicted by LP-HCLUS have been subsequently experimentally validated and introduced in a more recent release (v3.2) of HMDD. As future work, we will evaluate the performance of LP-HCLUS in other domains. 5 Acknowledgments We acknowledge the support of Ministry of Education, Universities and Research (MIUR) through the PON project TALIsMAn - Tecnologie di Assistenza per- sonALizzata per il Miglioramento della quAlità della vitA (ARS01 01116). Dr. Gianvito Pio acknowledges the support of Ministry of Education, Universities and Research (MIUR) through the project AIM1852414, activity 1, line 1. References 1. Alaimo, S., Giugno, R., Pulvirenti, A.: ncPred: ncRNA-Disease Association Pre- diction through Tripartite Network-Based Inference. Frontiers in Bioengineering and Biotechnology 2 (Dec 2014) 2. Barracchia, E.P., Pio, G., D’Elia, D., Ceci, M.: Prediction of new associations between ncrnas and diseases exploiting multi-type hierarchical clustering. BMC bioinformatics 21(1), 1–24 (2020) 3. Bauer-Mehren, A., Rautschka, M., Sanz, F., Furlong, L.I.: DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene-disease networks. Bioinfor- matics (Oxford, England) 26(22), 2924–2926 (Nov 2010) 4. Chen, G., Wang, Z., Wang, D., Qiu, C., Liu, M., Chen, X., Zhang, Q., Yan, G., Cui, Q.: LncRNADisease: a database for long-non-coding RNA-associated diseases. Nucleic Acids Research 41(Database issue) (Jan 2013) 5. Chen, X., Yan, C.C., Luo, C., Ji, W., Zhang, Y., Dai, Q.: Constructing lncRNA functional similarity network based on lncRNA-disease associations and disease semantic similarity. Scientific Reports 5 (Jun 2015) 6. Han, J., Kamber, M.: Data mining: concepts and techniques. Elsevier/Morgan Kaufmann, Amsterdam (2006) 7. Helwak, A., Kudla, G., et al.: Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding. Cell 153(3), 654–665 (2013) 8. Jiang, Q., Wang, Y., Hao, Y., Juan, L., Teng, M., Zhang, X., Li, M., Wang, G., Liu, Y.: miR2disease: a manually curated database for microRNA deregulation in human disease. Nucleic Acids Research 37(Database issue), D98–104 (Jan 2009) 9. Lesmo, L., Saitta, L., Torasso, P.: Evidence combination in expert systems. Inter- national Journal of Man-Machine Studies 22(3), 307–326 (Mar 1985) 10. Melissari, M.T., Grote, P.: Roles for long non-coding RNAs in physiology and disease. Pflügers Archiv - European Journal of Physiology 468(6), 945–958 (2016) 11. Mignone, P., Pio, G., D’Elia, D., Ceci, M.: Exploiting transfer learning for the reconstruction of the human gene regulatory network. Bioinform. 36(5), 1553–1561 (2020) 12. Pio, G., Ceci, M., D’Elia, D., Loglisci, C., Malerba, D.: A Novel Biclustering Algo- rithm for the Discovery of Meaningful Biological Correlations between microRNAs and their Target Genes. BMC Bioinformatics 14(Suppl 7), S8 (Apr 2013) 13. Pio, G., Ceci, M., Prisciandaro, F., Malerba, D.: Exploiting causality in gene net- work reconstruction based on graph embedding. Machine Learning (2019) 14. Pio, G., Serafino, F., Malerba, D., Ceci, M.: Multi-type clustering and classification from heterogeneous networks. Information Sciences 425, 107–126 (Jan 2018) 15. Wang, P., Guo, Q., et al.: Improved method for prioritization of disease associated lncRNAs based on ceRNA theory and functional genomics data. Oncotarget 8(3), 4642–4655 (Dec 2016)