-

Prediction of New Associations between ncRNAs and Diseases Exploiting Multi-Type Hierarchical Clustering (Discussion Paper)

Emanuele Pio Barracchia

0 2

Gianvito Pio

0 2

Domenica D'Elia

domenica.delia@ba.itb.cnr.it 1

Michelangelo Ceci

michelangelo.cecig@uniba.it 0 2 3 0 Big Data Laboratory , CINI Consortium - Rome , Italy 1 CNR, Institute for Biomedical Technologies - Bari , Italy 2 Dept. of Computer Science - University of Bari Aldo Moro , Bari , Italy 3 Dept. of Knowledge Technologies, Jozef Stefan Institute , Ljubljana , Slovenia

The study of functional associations between ncRNAs and human diseases is a pivotal task of modern research to develop new and more e ective therapeutic approaches. Nevertheless, it is not a trivial task since it involves entities of di erent types, such as microRNAs, lncRNAs or target genes. Such a complexity can be faced by representing the involved biological entities and their relationships as a network and by exploiting network-based computational approaches able to identify new associations. However, existing methods are limited to homogeneous networks or can exploit only a limited set of the features of biological entities. To overcome the limitations of existing approaches, we proposed the system LP-HCLUS, which analyzes heterogeneous networks consisting of several types of objects and relationships, each possibly described by a set of features, and extracts hierarchically organized, possibly overlapping, multi-type clusters that are subsequently exploited to predict new ncRNA-disease associations. Our experimental evaluation shows that, according to both quantitative (i.e., TPR@k, ROC and PR curves) and qualitative criteria, LP-HCLUS produces better results.

non-coding RNA (ncRNAs) diseases cancer heteroge- neous network clustering link prediction

High-throughput sequencing technologies and recent, more e cient computational approaches, have been fundamental for the rapid advances in functional genomics. Among the most relevant results, there is the discovery of thousands of non-coding RNAs (ncRNAs) with a regulatory function on gene expression.

In parallel, the number of studies reporting the involvement of ncRNAs in the development of many di erent human diseases has grown exponentially. The Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy. rst type of ncRNAs that has been discovered and largely studied is that of microRNAs (miRNAs), classi ed as small non-coding RNAs in contrast with long non-coding RNAs (lncRNAs), that are ncRNAs longer than 200nt. While miRNAs primarily act as post-transcriptional regulators, lncRNAs have a plethora of regulatory functions [ 10 ]. However, the number of lncRNAs for which the functional and molecular mechanisms are completely elucidated is still quite poor and experimental investigations are still too much expensive for being carried out without any computational pre-analysis. In the last few years, there have been several attempts to computationally predict the relationships among biological entities, such as genes, miRNAs, lncRNAs, diseases [ 1,11,13,15 ]. Such methods are based on a network representation of the entities under study and on the identi cation of new links among nodes in the network. However, most of them are able to work only on homogeneous networks (where nodes and links are of one single type) [ 5 ], are strongly limited by the number of di erent node types or are constrained by pre-de ned network structures.

In this discussion paper, we describe the method LP-HCLUS [ 2 ], that is able to overcome these limitations. In particular, it can discover new ncRNA-disease relationships from heterogeneous attributed networks (i.e., consisting of di erent biological entities related by di erent types of relationships) with arbitrary structure. This ability allows LP-HCLUS to investigate the interactions among di erent types of entities, possibly leading to increased prediction accuracy.

LP-HCLUS exploits a combined approach based on hierarchical, multi-type clustering and link prediction. As we will detail in the next section, a multi-type cluster is actually a heterogeneous sub-network. Therefore, the adoption of a clustering-based approach allows LP-HCLUS to base its predictions on relevant, highly-cohesive heterogeneous sub-networks. Moreover, the hierarchical organization of clusters allows it to perform predictions at di erent levels of granularity, taking into account either local/speci c or global/general relationships. 2

Method

In the following, we introduce the notation and some useful de nitions. De nition 1 (Heterogeneous attributed network). A heterogeneous attributed network is a network G = (V; E), where V is the set of nodes and E is the set of edges, and both nodes and edges can be of di erent types. Moreover: { T = Tt [ Ttr is the set of node types, where Tt is the set of target types, i.e. considered as target of the clustering/prediction task, and Ttr is the set of task-relevant types. Only nodes of target types are clustered and considered in the identi cation of new relationships. { Each node type Tv 2 T de nes a subset of nodes in the network, i.e., Vv V . { Each node type Tv 2 T is associated with a set of attributes Av = fAv;1; Av;2; :::; Av;mv g, i.e., nodes of the type Tv are described by the attributes Av. { R is the set of all the possible edge types. { Each edge type Rl 2 R de nes a subset of edges El E. C2 C3 C4 C5

C1, C2 C4, C5

C1, C2, C3 (b)

C1, C2, C3,

C4, C5 (a)

De nition 2 (Hierarchical multi-type clustering). A hierarchy of multitype clusters is de ned as a list of hierarchy levels [L1; L2; : : : ; Lk], where each Li consists of a set of overlapping multi-type clusters. For each level Li; i = 2; 3; :: : : : k, 8 G0 2 Li 9 G00 2 Li 1, such that G00 is a subnetwork of G0 (Fig. 1). According to these de nitions, we de ne the task considered in this work. De nition 3 (Predictive hierarchical clustering for link prediction). Given a heterogeneous attributed network G = (V; E) and the set of target types Tt, the goal is to nd: { A hierarchy of overlapping multi-type clusters [L1; L2; : : : ; Lk]. { A function (w): Vi1 Vi2 ![0; 1] for each hierarchical level Lw (w 2 1; 2; :::; k), where nodes in Vi1 are of type Ti1 2 Tt and nodes in Vi2 are of type Ti2 2 Tt. Each function (w) maps each possible pair of nodes (of types Ti1 and Ti2 ) to a score representing the degree of certainty of their relationship. In this paper LP-HCLUS has been used to solve the task formalized in De nition 3, by considering ncRNAs and diseases as target types. Hence, we determine two distinct set of nodes denoted by Tn and Td, representing the set of ncRNAs and the set of diseases, respectively. In the following subsections, we will describe the main steps executed by LP-HCLUS (see Fig. 2 for a general overview).

2.1 Estimation of the strength of the relationship

In the rst phase, we estimate the strength of the relationship among all the possible ncRNA-disease pairs in the network G. In particular, we aim to compute a score s(ni; dj ) for each possible pair ni; dj , by exploiting the concept of meta-path. According to [ 14 ], a meta-path is a set of sequences of nodes (involving both target and task-relevant types) which follow the same sequence of edge types, and can be used to fruitfully represent conceptual (possibly indirect) relationships between two entities in a heterogeneous network. Given the ncRNA ni and the disease dj , the relationship between them can be considered \certain" if there is at least one meta-path which con rms its certainty.

Estimation of the strength of relationships Predicted relationships d1 s(n1, d1) n1 d3 s(n1..,.d3) n1 di s(nk, di) nk

Extracted edges w1 w2 ... w|Ê| Prediction

Construction of the hierarchy of clusters Hierarchy of clusters

Therefore, by assimilating the score associated with an interaction to its degree of certainty, we compute s(ni; dj ) as the maximum value observed over all the possible meta-paths between ni and dj . Formally: s(ni; dj ) =

max P 2metapaths(ni;dj) pathscore(P; ni; dj ) (1) where metapaths(ni; dj ) is the set of meta-paths connecting ni and dj , and pathscore(P; ni; dj ) is the degree of certainty of the relationship between ni and dj according to the meta-path P . In order to compute pathscore(P; ni; dj ), we represent each meta-path P as a nite set of sequences of nodes. If a sequence in P connects ni and dj , then pathscore(P; ni; dj ) = 1. Otherwise, following the same strategy introduced before, it is computed as the maximum similarity between the sequences which start with ni and the sequences which end with dj (see Fig. 3). The intuition behind this formula is that if ni and dj are not directly connected, their score represents the similarity of the nodes and edges they are connected to. The similarity between two sequences seq0 and seq00 is computed according to the the attributes of all nodes involved in the two sequences: following [ 6 ], if x is numeric, then sx(seq0; seq00) = 1 jvalx(mseaqx0)x vmalixn(xseq00)j , where minx (resp. maxx) is the minimum (resp. maximum) value, for the attribute x; if x is not a numeric attribute, sx(seq0; seq00) = 1 if valx(seq0) = valx(seq00), 0 otherwise. In this solution there could be some node types that are not involved in any meta-path. In order to exploit the information conveyed by these nodes, we add an aggregation of their attribute values (the arithmetic mean for numerical attributes, the mode for non-numerical attributes) to the nodes that are connected to them and that appear in at least one meta-path. 2.2 Construction of a hierarchy of overlapping multi-type clusters We construct the rst level of the hierarchy by identifying a set of overlapping multi-type clusters in the form of bicliques. To this aim, we perform three steps: i) Filtering, which keeps only the ncRNA-disease pairs with a score greater than (or equal to) . The result of this step is the subset f(ni; dj )js(ni; dj ) g ii) Initialization, which builds the initial set of clusters in the form of bicliques, each consisting of a ncRNA-disease pair in f(ni; dj )js(ni; dj ) g. iii) Merging, which iteratively merges two clusters C0 and C00 into a new cluster C000. This step regards the initial set of clusters as a list sorted according to an ordering relation <c that re ects the quality of the clusters. Each cluster C0 is then merged with the rst cluster C00 in the list that would lead to a cluster C000 which still satis es the biclique constraint. This step is repeated until no additional clusters that satisfy the biclique constraint can be obtained.

The ordering relation <c de nes a greedy search strategy that guides the order in which pairs of clusters are analyzed. <c is based on the cluster cohesiveness h(c), that corresponds to the average score in the cluster, namely: h(C) = jpair1s(C)j P(ni;dj)2pairs(C) s(ni; dj ), where pairs(C) is the set of all the possible ncRNA-disease pairs that can be constructed from the set of ncRNAs and diseases in the cluster. Accordingly, if C0 and C00 are two di erent clusters, the ordering relation <c is de ned as follows: C0 <c C00 () h(C0) > h(C00).

The approach adopted to build the other hierarchical levels is similar to the merging step performed to obtain L1. The main di erence is that we do not obtain bicliques, but generic multi-type clusters. Since the biclique constraint is removed, we need another stopping criterion for the iterative merging procedure. Coherently with approaches used in hierarchical co-clustering and following [ 12 ], we adopt a user-de ned threshold on the cohesiveness of the obtained clusters. In particular, two clusters C0 and C00 can be merged into a new cluster C000 if h(C000) > , where h(C000) is the cluster cohesiveness. This means that de nes the minimum cluster cohesiveness that must be satis ed by a cluster obtained after a merging. The iterative process stops when it is not possible to merge more clusters with a minimum level of cohesiveness .

2.3 Prediction of new ncRNA-disease relationships

In the last phase, we exploit each level of the identi ed hierarchy of multi-type clusters as a prediction model. In particular, we compute, for each ncRNAdisease pair, a score representing its degree of certainty on the basis of the multi-type clusters containing it. Formally, let Ciwj be a cluster identi ed in the w-th hierarchical level in which the ncRNA ni and the disease dj appear. We compute the degree of certainty of the relationship between ni and dj as:

(w)(ni; dj ) = h Ciwj , that is, we compute the degree of certainty of the new interaction as the average degree of certainty of the known relationships in the cluster. In some cases, the same interaction may appear in multiple clusters, since the proposed algorithm is able to identify overlapping clusters. In this case, Ciwj represents the list of multi-type clusters in which both ni and dj appear and we aggregate their cohesiveness values according to four di erent strategies: maximum, minimum, average and evidence combination [ 9 ]. 3

Experiments

LP-HCLUS has been run with di erent values of its input parameters. In particular, following the results obtained in [ 12 ], we considered 2 f0:1; 0:2g and 2 f0:3; 0:4g. The considered datasets are: i) HMDD v3.0 which stores 985 miRNAs, 675 diseases and 20,859 relationships between diseases and miRNAs; ii) Integrated Dataset (ID), built by integrating multiple datasets [ 3,4,7,8 ], composed by 7,049 diseases, 70 lncRNA-miRNA relationships, 3,830 relationships between diseases and ncRNAs, 90,242 target genes, 26,522 disease-target associations and 1,055 ncRNA-target relationships.

We compared LP-HCLUS with the following competitors: i) HOCCLUS2 [ 12 ], a biclustering algorithm that, similarly to LP-HCLUS, identi es a hierarchy of (possibly overlapping) heterogeneous clusters. It is, however, limited to work with only two types of objects. Since its parameters have a similar meaning with respect to LP-HCLUS parameters, we evaluated its results with the same setting, i.e., 2 f0:1; 0:2g and 2 f0:3; 0:4g; ii) ncPred [ 1 ], a system that was speci cally designed to predict new ncRNAdisease associations. ncPred cannot catch information coming from other entities in the network and it is not able to exploit features associated to nodes and links. iii) LP-HCLUS-NoLP, which corresponds to a baseline version of system LPHCLUS, without the clustering and the link prediction steps. In particular, we consider the score obtained in the rst phase of LP-HCLUS (see Section 2.1) as the nal score associated with each interaction.

We adopted the 10-fold cross validation on the set of known ncRNA-disease relationships and, due to absence of negative samples, we evaluated the results in terms of TruePositiveRate@k curve. Moreover, we also report the results in terms of ROC and Precision-Recall curves by considering the unknown relationships as negative examples. We remark that ROC and PR curves can only be used for relative comparison and not as absolute evaluation measures because they are spoiled by the assumption made on unknown relationships.

In Figs. 4 and 5 we show some results obtained with the most promising congurations. From the quantitative viewpoint, we can observe that the proposed method LP-HCLUS, with the combination strategy based on the maximum, is able to obtain the best performances, for all the considered measures. From a qualitative point of view, we rst performed a comparative analysis between the results obtained by LP-HCLUS against the validated interactions reported in the updated version of HMDD (i.e., v3.2 released on March 27th, 2019). We found 3,055 LP-HCLUS predictions con rmed by the new release of HMDD at the hierarchy level 1, 4,119 at level 2 and 4,797 at level 3. Next, we conducted a qualitative analysis of the top-ranked relationships predicted by LP-HCLUS using ID dataset, selecting only those with a score equal to 1.0. For this purpose, we exploited MNDR v2.0, which is a comprehensive resource including more than 260,000 experimental and predicted ncRNA-disease associations for mammalian species. Also in this case, we found some associations in both MNDR and in the list of predicted associations by LP-HCLUS. A more comprehensive analysis, reporting several additional examples, can be found in the full paper [ 2 ]. 4

Conclusions

In this paper, we have tackled the problem of predicting possibly unknown ncRNA-disease relationships. The proposed approach LP-HCLUS is able to take advantage from the possible heterogeneous nature of the attributed biological network analyzed. The results con rm the initial intuitions and show competitive performances of LP-HCLUS in terms of accuracy of the predictions, also when compared with state-of-the-art competitor systems. These results are also supported by a comparison of LP-HCLUS predictions with data reported in MNDR and by a qualitative analysis that revealed that several ncRNA-disease associations predicted by LP-HCLUS have been subsequently experimentally validated and introduced in a more recent release (v3.2) of HMDD. As future work, we will evaluate the performance of LP-HCLUS in other domains. 5

Acknowledgments

We acknowledge the support of Ministry of Education, Universities and Research (MIUR) through the PON project TALIsMAn - Tecnologie di Assistenza personALizzata per il Miglioramento della quAlita della vitA (ARS01 01116). Dr. Gianvito Pio acknowledges the support of Ministry of Education, Universities and Research (MIUR) through the project AIM1852414, activity 1, line 1.

1. Alaimo , S. , Giugno , R. , Pulvirenti , A. : ncPred: ncRNA-Disease Association Prediction through Tripartite Network-Based Inference . Frontiers in Bioengineering and Biotechnology 2 ( Dec 2014 )

2. Barracchia , E.P. , Pio , G., D'Elia , D. , Ceci , M.: Prediction of new associations between ncrnas and diseases exploiting multi-type hierarchical clustering . BMC bioinformatics 21(1) , 1 { 24 ( 2020 )

3. Bauer-Mehren , A. , Rautschka , M. , Sanz , F. , Furlong , L.I.: DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene-disease networks . Bioinformatics (Oxford, England) 26 ( 22 ), 2924 {2926 (Nov 2010 )

4. Chen , G. , Wang , Z. , Wang , D. , Qiu , C. , Liu , M. , Chen , X. , Zhang , Q. , Yan , G. , Cui , Q. : LncRNADisease: a database for long-non-coding RNA-associated diseases . Nucleic Acids Research 41 ( Database

issue)

( Jan 2013 )

5. Chen , X. , Yan , C.C. , Luo , C. , Ji , W. , Zhang, Y. , Dai , Q. : Constructing lncRNA functional similarity network based on lncRNA-disease associations and disease semantic similarity . Scienti c Reports 5 ( Jun 2015 )

6. Han, J ., Kamber , M. : Data mining: concepts and techniques . Elsevier/Morgan Kaufmann, Amsterdam ( 2006 )

7. Helwak , A. , Kudla , G. , et al.: Mapping the human miRNA interactome by CLASH reveals frequent noncanonical binding . Cell 153 ( 3 ), 654 { 665 ( 2013 )

8. Jiang , Q. , Wang , Y. , Hao , Y. , Juan , L. , Teng , M. , Zhang , X. , Li , M. , Wang , G. , Liu, Y. : miR2disease: a manually curated database for microRNA deregulation in human disease . Nucleic Acids Research 37 ( Database issue ), D98{104 (Jan 2009 )

9. Lesmo , L. , Saitta , L. , Torasso , P. : Evidence combination in expert systems . International Journal of Man-Machine Studies 22 ( 3 ), 307 {326 (Mar 1985 )

10. Melissari , M.T. , Grote , P. : Roles for long non-coding RNAs in physiology and disease . P ugers Archiv - European Journal of Physiology 468 ( 6 ), 945 { 958 ( 2016 )

11. Mignone , P. , Pio , G., D'Elia , D. , Ceci , M. : Exploiting transfer learning for the reconstruction of the human gene regulatory network . Bioinform . 36 ( 5 ), 1553 { 1561 ( 2020 )

12. Pio , G. , Ceci , M. , D'Elia , D. , Loglisci , C. , Malerba , D. :

A Novel

Biclustering Algorithm for the Discovery of Meaningful Biological Correlations between microRNAs and their Target Genes . BMC Bioinformatics 14 ( Suppl 7 ), S8 (Apr 2013 )

13. Pio , G. , Ceci , M. , Prisciandaro , F. , Malerba , D. : Exploiting causality in gene network reconstruction based on graph embedding . Machine Learning ( 2019 )

14. Pio , G. , Sera no, F. , Malerba , D. , Ceci , M.: Multi-type clustering and classi cation from heterogeneous networks . Information Sciences 425 , 107 {126 (Jan 2018 )

15. Wang , P. , Guo , Q. , et al.: Improved method for prioritization of disease associated lncRNAs based on ceRNA theory and functional genomics data . Oncotarget 8 ( 3 ), 4642 {4655 (Dec 2016 )