=Paper=
{{Paper
|id=Vol-2646/37-paper
|storemode=property
|title=Prediction of New Associations between ncRNAs and Diseases Exploiting Multi-Type Hierarchical Clustering
|pdfUrl=https://ceur-ws.org/Vol-2646/37-paper.pdf
|volume=Vol-2646
|authors=Emanuele Pio Barracchia,Gianvito Pio,Domenica D'Elia,Michelangelo Ceci
|dblpUrl=https://dblp.org/rec/conf/sebd/BarracchiaPDC20
}}
==Prediction of New Associations between ncRNAs and Diseases Exploiting Multi-Type Hierarchical Clustering==
<pdf width="1500px">https://ceur-ws.org/Vol-2646/37-paper.pdf</pdf>
<pre>
      Prediction of New Associations between
    ncRNAs and Diseases Exploiting Multi-Type
     Hierarchical Clustering (Discussion Paper)

       Emanuele Pio Barracchia1,3 , Gianvito Pio1,3 , Domenica D’Elia2 , and
                            Michelangelo Ceci1,3,4
       1
         Dept. of Computer Science - University of Bari Aldo Moro, Bari (Italy)
       {emanuele.barracchia, gianvito.pio, michelangelo.ceci}@uniba.it
              2
                  CNR, Institute for Biomedical Technologies - Bari (Italy)
                            domenica.delia@ba.itb.cnr.it
                3
                   Big Data Laboratory, CINI Consortium - Rome (Italy)
    4
      Dept. of Knowledge Technologies, Jožef Stefan Institute, Ljubljana (Slovenia)


           Abstract. The study of functional associations between ncRNAs and
           human diseases is a pivotal task of modern research to develop new
           and more effective therapeutic approaches. Nevertheless, it is not a triv-
           ial task since it involves entities of different types, such as microRNAs,
           lncRNAs or target genes. Such a complexity can be faced by representing
           the involved biological entities and their relationships as a network and
           by exploiting network-based computational approaches able to identify
           new associations. However, existing methods are limited to homogeneous
           networks or can exploit only a limited set of the features of biological enti-
           ties. To overcome the limitations of existing approaches, we proposed the
           system LP-HCLUS, which analyzes heterogeneous networks consisting of
           several types of objects and relationships, each possibly described by a
           set of features, and extracts hierarchically organized, possibly overlap-
           ping, multi-type clusters that are subsequently exploited to predict new
           ncRNA-disease associations. Our experimental evaluation shows that,
           according to both quantitative (i.e., TPR@k, ROC and PR curves) and
           qualitative criteria, LP-HCLUS produces better results.

           Keywords: non-coding RNA (ncRNAs) · diseases · cancer · heteroge-
           neous network · clustering · link prediction

1     Introduction
High-throughput sequencing technologies and recent, more efficient computa-
tional approaches, have been fundamental for the rapid advances in functional
genomics. Among the most relevant results, there is the discovery of thousands
of non-coding RNAs (ncRNAs) with a regulatory function on gene expression.
    In parallel, the number of studies reporting the involvement of ncRNAs in
the development of many different human diseases has grown exponentially. The
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). This volume is published
    and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy.
first type of ncRNAs that has been discovered and largely studied is that of mi-
croRNAs (miRNAs), classified as small non-coding RNAs in contrast with long
non-coding RNAs (lncRNAs), that are ncRNAs longer than 200nt. While miR-
NAs primarily act as post-transcriptional regulators, lncRNAs have a plethora of
regulatory functions [10]. However, the number of lncRNAs for which the func-
tional and molecular mechanisms are completely elucidated is still quite poor
and experimental investigations are still too much expensive for being carried
out without any computational pre-analysis. In the last few years, there have
been several attempts to computationally predict the relationships among bi-
ological entities, such as genes, miRNAs, lncRNAs, diseases [1,11,13,15]. Such
methods are based on a network representation of the entities under study and
on the identification of new links among nodes in the network. However, most
of them are able to work only on homogeneous networks (where nodes and links
are of one single type) [5], are strongly limited by the number of different node
types or are constrained by pre-defined network structures.
    In this discussion paper, we describe the method LP-HCLUS [2], that is able
to overcome these limitations. In particular, it can discover new ncRNA-disease
relationships from heterogeneous attributed networks (i.e., consisting of differ-
ent biological entities related by different types of relationships) with arbitrary
structure. This ability allows LP-HCLUS to investigate the interactions among
different types of entities, possibly leading to increased prediction accuracy.
    LP-HCLUS exploits a combined approach based on hierarchical, multi-type
clustering and link prediction. As we will detail in the next section, a multi-type
cluster is actually a heterogeneous sub-network. Therefore, the adoption of a
clustering-based approach allows LP-HCLUS to base its predictions on relevant,
highly-cohesive heterogeneous sub-networks. Moreover, the hierarchical organi-
zation of clusters allows it to perform predictions at different levels of granularity,
taking into account either local/specific or global/general relationships.

2    Method
In the following, we introduce the notation and some useful definitions.
Definition 1 (Heterogeneous attributed network). A heterogeneous at-
tributed network is a network G = (V, E), where V is the set of nodes and E is
the set of edges, and both nodes and edges can be of different types. Moreover:

 – T = Tt ∪ Ttr is the set of node types, where Tt is the set of target types, i.e.
   considered as target of the clustering/prediction task, and Ttr is the set of
   task-relevant types. Only nodes of target types are clustered and considered
   in the identification of new relationships.
 – Each node type Tv ∈ T defines a subset of nodes in the network, i.e., Vv ⊆ V .
 – Each node type Tv ∈ T is associated with a set of attributes Av = {Av,1 , Av,2 ,
   ..., Av,mv }, i.e., nodes of the type Tv are described by the attributes Av .
 – R is the set of all the possible edge types.
 – Each edge type Rl ∈ R defines a subset of edges El ⊆ E.
                                                   C1

                                                          C1, C2

                                                   C2
                                                                   C1, C2, C3


                                                   C3
                                                                                C1, C2, C3,
                                                                                  C4, C5

                                                   C4

                                                          C4, C5

                                                   C5


                          (a)                                            (b)

Fig. 1. A hierarchy of overlapping multi-type clusters: (a) emphasizes the overlapping
among multi-type clusters; (b) shows their hierarchical organization.


Definition 2 (Hierarchical multi-type clustering). A hierarchy of multi-
type clusters is defined as a list of hierarchy levels [L1 , L2 , . . . , Lk ], where each
Li consists of a set of overlapping multi-type clusters. For each level Li , i =
2, 3, .. . . . k, ∀ G0 ∈ Li ∃ G00 ∈ Li−1 , such that G00 is a subnetwork of G0 (Fig. 1).

According to these definitions, we define the task considered in this work.

Definition 3 (Predictive hierarchical clustering for link prediction).
Given a heterogeneous attributed network G = (V, E) and the set of target types
Tt , the goal is to find:
 – A hierarchy of overlapping multi-type clusters [L1 , L2 , . . . , Lk ].
 – A function ψ (w) : Vi1 ×Vi2 →[0, 1] for each hierarchical level Lw (w ∈ 1, 2, ..., k),
   where nodes in Vi1 are of type Ti1 ∈ Tt and nodes in Vi2 are of type Ti2 ∈ Tt .
   Each function ψ (w) maps each possible pair of nodes (of types Ti1 and Ti2 )
   to a score representing the degree of certainty of their relationship.

In this paper LP-HCLUS has been used to solve the task formalized in Definition
3, by considering ncRNAs and diseases as target types. Hence, we determine two
distinct set of nodes denoted by Tn and Td , representing the set of ncRNAs and
the set of diseases, respectively. In the following subsections, we will describe the
main steps executed by LP-HCLUS (see Fig. 2 for a general overview).
2.1 Estimation of the strength of the relationship
In the first phase, we estimate the strength of the relationship among all the
possible ncRNA-disease pairs in the network G. In particular, we aim to com-
pute a score s(ni , dj ) for each possible pair ni , dj , by exploiting the concept of
meta-path. According to [14], a meta-path is a set of sequences of nodes (in-
volving both target and task-relevant types) which follow the same sequence of
edge types, and can be used to fruitfully represent conceptual (possibly indi-
rect) relationships between two entities in a heterogeneous network. Given the
ncRNA ni and the disease dj , the relationship between them can be consid-
ered “certain” if there is at least one meta-path which confirms its certainty.
                                                             Extracted edges

                                                                   w1

                                       Estimation of the
                                                                               Construction of the
                                          strength of              w2
                                                                               hierarchy of clusters
                                        relationships
                                                                   ...
                                                                  w|Ê|


                                   Predicted relationships                     Hierarchy of clusters

                                          s(n1, d1)
                                     d1               n1

                                          s(n1, d3)              Prediction
                                     d3               n1
                                             ...
                                          s(nk, di)
                                     di               nk


                   Fig. 2. General workflow of the method LP-HCLUS.


Therefore, by assimilating the score associated with an interaction to its degree
of certainty, we compute s(ni , dj ) as the maximum value observed over all the
possible meta-paths between ni and dj . Formally:

                   s(ni , dj ) =                   max         pathscore(P, ni , dj )                  (1)
                                   P ∈metapaths(ni ,dj )


where metapaths(ni , dj ) is the set of meta-paths connecting ni and dj , and
pathscore(P, ni , dj ) is the degree of certainty of the relationship between ni and
dj according to the meta-path P . In order to compute pathscore(P, ni , dj ), we
represent each meta-path P as a finite set of sequences of nodes. If a sequence
in P connects ni and dj , then pathscore(P, ni , dj ) = 1. Otherwise, following
the same strategy introduced before, it is computed as the maximum similarity
between the sequences which start with ni and the sequences which end with dj
(see Fig. 3). The intuition behind this formula is that if ni and dj are not directly
connected, their score represents the similarity of the nodes and edges they are
connected to. The similarity between two sequences seq 0 and seq 00 is computed
according to the the attributes of all nodes involved in the two sequences: fol-
                                                                        0
                                                                          )−valx (seq 00 )|
lowing [6], if x is numeric, then sx (seq 0 , seq 00 ) = 1 − |valx (seq
                                                                    maxx −minx              , where
minx (resp. maxx ) is the minimum (resp. maximum) value, for the attribute x;
if x is not a numeric attribute, sx (seq 0 , seq 00 ) = 1 if valx (seq 0 ) = valx (seq 00 ), 0
otherwise. In this solution there could be some node types that are not involved
in any meta-path. In order to exploit the information conveyed by these nodes,
we add an aggregation of their attribute values (the arithmetic mean for nu-
merical attributes, the mode for non-numerical attributes) to the nodes that are
connected to them and that appear in at least one meta-path.

2.2 Construction of a hierarchy of overlapping multi-type clusters
We construct the first level of the hierarchy by identifying a set of overlapping
multi-type clusters in the form of bicliques. To this aim, we perform three steps:
i) Filtering, which keeps only the ncRNA-disease pairs with a score greater
than (or equal to) β. The result of this step is the subset {(ni , dj )|s(ni , dj ) ≥ β}
                                     Attributes of other entities
                  ncRNA Attributes                                  Disease Attributes
                                             in the path
         Seq n.

           1
           2
           3
           4
           5
           6
           7
           8
           9


Fig. 3. Sequences between the ncRNA “h19” and the disease “asthma” according to
a meta-path. Sequences emphasized in yellow (1 and 9) are those starting with “h19”,
while sequences emphasized in blue (4, 5, 6 and 7) are those ending with “asthma”.


ii) Initialization, which builds the initial set of clusters in the form of bicliques,
each consisting of a ncRNA-disease pair in {(ni , dj )|s(ni , dj ) ≥ β}.
iii) Merging, which iteratively merges two clusters C 0 and C 00 into a new cluster
C 000 . This step regards the initial set of clusters as a list sorted according to an
ordering relation <c that reflects the quality of the clusters. Each cluster C 0 is
then merged with the first cluster C 00 in the list that would lead to a cluster
C 000 which still satisfies the biclique constraint. This step is repeated until no
additional clusters that satisfy the biclique constraint can be obtained.
     The ordering relation <c defines a greedy search strategy that guides the
order in which pairs of clusters are analyzed. <c is based on the cluster co-
hesiveness h(c), thatP corresponds to the average score in the cluster, namely:
               1
h(C) = |pairs(C)|   · (ni ,dj )∈pairs(C) s(ni , dj ), where pairs(C) is the set of all the
possible ncRNA-disease pairs that can be constructed from the set of ncRNAs
and diseases in the cluster. Accordingly, if C 0 and C 00 are two different clusters,
the ordering relation <c is defined as follows: C 0 <c C 00 ⇐⇒ h(C 0 ) > h(C 00 ).
     The approach adopted to build the other hierarchical levels is similar to the
merging step performed to obtain L1 . The main difference is that we do not
obtain bicliques, but generic multi-type clusters. Since the biclique constraint is
removed, we need another stopping criterion for the iterative merging procedure.
Coherently with approaches used in hierarchical co-clustering and following [12],
we adopt a user-defined threshold α on the cohesiveness of the obtained clusters.
In particular, two clusters C 0 and C 00 can be merged into a new cluster C 000 if
h(C 000 ) > α, where h(C 000 ) is the cluster cohesiveness. This means that α defines
the minimum cluster cohesiveness that must be satisfied by a cluster obtained
after a merging. The iterative process stops when it is not possible to merge
more clusters with a minimum level of cohesiveness α.
2.3 Prediction of new ncRNA-disease relationships
In the last phase, we exploit each level of the identified hierarchy of multi-type
clusters as a prediction model. In particular, we compute, for each ncRNA-
disease pair, a score representing its degree of certainty on the basis of the
                                                     w
multi-type clusters containing it. Formally, let Cij    be a cluster identified in
the w-th hierarchical level in which the ncRNA ni and the disease dj appear.
We compute the degree of certainty of the relationship between ni and dj as:
Fig. 4. TPR@k, Precision-Recall and ROC curves results for the dataset HMDD v3.0,
obtained with the best configuration (α = 0.2, β = 0.4, level = 2).

                          
ψ (w) (ni , dj ) = h Cijw
                            , that is, we compute the degree of certainty of the new
interaction as the average degree of certainty of the known relationships in the
cluster. In some cases, the same interaction may appear in multiple clusters,
since the proposed algorithm is able to identify overlapping clusters. In this
         w
case, Cij    represents the list of multi-type clusters in which both ni and dj appear
and we aggregate their cohesiveness values according to four different strategies:
maximum, minimum, average and evidence combination [9].

3    Experiments
LP-HCLUS has been run with different values of its input parameters. In par-
ticular, following the results obtained in [12], we considered α ∈ {0.1, 0.2} and
β ∈ {0.3, 0.4}. The considered datasets are: i) HMDD v3.0 which stores 985
miRNAs, 675 diseases and 20,859 relationships between diseases and miRNAs;
ii) Integrated Dataset (ID), built by integrating multiple datasets [3,4,7,8],
composed by 7,049 diseases, 70 lncRNA-miRNA relationships, 3,830 relation-
ships between diseases and ncRNAs, 90,242 target genes, 26,522 disease-target
associations and 1,055 ncRNA-target relationships.
     We compared LP-HCLUS with the following competitors:
i) HOCCLUS2 [12], a biclustering algorithm that, similarly to LP-HCLUS,
identifies a hierarchy of (possibly overlapping) heterogeneous clusters. It is, how-
ever, limited to work with only two types of objects. Since its parameters have a
similar meaning with respect to LP-HCLUS parameters, we evaluated its results
with the same setting, i.e., α ∈ {0.1, 0.2} and β ∈ {0.3, 0.4};
ii) ncPred [1], a system that was specifically designed to predict new ncRNA-
disease associations. ncPred cannot catch information coming from other entities
in the network and it is not able to exploit features associated to nodes and links.
iii) LP-HCLUS-NoLP, which corresponds to a baseline version of system LP-
HCLUS, without the clustering and the link prediction steps. In particular, we
consider the score obtained in the first phase of LP-HCLUS (see Section 2.1) as
the final score associated with each interaction.
     We adopted the 10-fold cross validation on the set of known ncRNA-disease
relationships and, due to absence of negative samples, we evaluated the results in
Fig. 5. TPR@k, Precision-Recall and ROC curves results for the dataset ID, obtained
with the best configuration (α = 0.1, β = 0.4, level = 1).


terms of TruePositiveRate@k curve. Moreover, we also report the results in terms
of ROC and Precision-Recall curves by considering the unknown relationships
as negative examples. We remark that ROC and PR curves can only be used for
relative comparison and not as absolute evaluation measures because they are
spoiled by the assumption made on unknown relationships.
    In Figs. 4 and 5 we show some results obtained with the most promising con-
figurations. From the quantitative viewpoint, we can observe that the proposed
method LP-HCLUS, with the combination strategy based on the maximum, is
able to obtain the best performances, for all the considered measures. From a
qualitative point of view, we first performed a comparative analysis between the
results obtained by LP-HCLUS against the validated interactions reported in
the updated version of HMDD (i.e., v3.2 released on March 27th, 2019). We
found 3,055 LP-HCLUS predictions confirmed by the new release of HMDD at
the hierarchy level 1, 4,119 at level 2 and 4,797 at level 3. Next, we conducted a
qualitative analysis of the top-ranked relationships predicted by LP-HCLUS us-
ing ID dataset, selecting only those with a score equal to 1.0. For this purpose, we
exploited MNDR v2.0, which is a comprehensive resource including more than
260,000 experimental and predicted ncRNA-disease associations for mammalian
species. Also in this case, we found some associations in both MNDR and in the
list of predicted associations by LP-HCLUS. A more comprehensive analysis,
reporting several additional examples, can be found in the full paper [2].

4   Conclusions
In this paper, we have tackled the problem of predicting possibly unknown
ncRNA-disease relationships. The proposed approach LP-HCLUS is able to take
advantage from the possible heterogeneous nature of the attributed biological
network analyzed. The results confirm the initial intuitions and show compet-
itive performances of LP-HCLUS in terms of accuracy of the predictions, also
when compared with state-of-the-art competitor systems. These results are also
supported by a comparison of LP-HCLUS predictions with data reported in
MNDR and by a qualitative analysis that revealed that several ncRNA-disease
associations predicted by LP-HCLUS have been subsequently experimentally
validated and introduced in a more recent release (v3.2) of HMDD. As future
work, we will evaluate the performance of LP-HCLUS in other domains.
5    Acknowledgments
We acknowledge the support of Ministry of Education, Universities and Research
(MIUR) through the PON project TALIsMAn - Tecnologie di Assistenza per-
sonALizzata per il Miglioramento della quAlità della vitA (ARS01 01116). Dr.
Gianvito Pio acknowledges the support of Ministry of Education, Universities
and Research (MIUR) through the project AIM1852414, activity 1, line 1.
References
 1. Alaimo, S., Giugno, R., Pulvirenti, A.: ncPred: ncRNA-Disease Association Pre-
    diction through Tripartite Network-Based Inference. Frontiers in Bioengineering
    and Biotechnology 2 (Dec 2014)
 2. Barracchia, E.P., Pio, G., D’Elia, D., Ceci, M.: Prediction of new associations
    between ncrnas and diseases exploiting multi-type hierarchical clustering. BMC
    bioinformatics 21(1), 1–24 (2020)
 3. Bauer-Mehren, A., Rautschka, M., Sanz, F., Furlong, L.I.: DisGeNET: a Cytoscape
    plugin to visualize, integrate, search and analyze gene-disease networks. Bioinfor-
    matics (Oxford, England) 26(22), 2924–2926 (Nov 2010)
 4. Chen, G., Wang, Z., Wang, D., Qiu, C., Liu, M., Chen, X., Zhang, Q., Yan, G.,
    Cui, Q.: LncRNADisease: a database for long-non-coding RNA-associated diseases.
    Nucleic Acids Research 41(Database issue) (Jan 2013)
 5. Chen, X., Yan, C.C., Luo, C., Ji, W., Zhang, Y., Dai, Q.: Constructing lncRNA
    functional similarity network based on lncRNA-disease associations and disease
    semantic similarity. Scientific Reports 5 (Jun 2015)
 6. Han, J., Kamber, M.: Data mining: concepts and techniques. Elsevier/Morgan
    Kaufmann, Amsterdam (2006)
 7. Helwak, A., Kudla, G., et al.: Mapping the human miRNA interactome by CLASH
    reveals frequent noncanonical binding. Cell 153(3), 654–665 (2013)
 8. Jiang, Q., Wang, Y., Hao, Y., Juan, L., Teng, M., Zhang, X., Li, M., Wang, G.,
    Liu, Y.: miR2disease: a manually curated database for microRNA deregulation in
    human disease. Nucleic Acids Research 37(Database issue), D98–104 (Jan 2009)
 9. Lesmo, L., Saitta, L., Torasso, P.: Evidence combination in expert systems. Inter-
    national Journal of Man-Machine Studies 22(3), 307–326 (Mar 1985)
10. Melissari, M.T., Grote, P.: Roles for long non-coding RNAs in physiology and
    disease. Pflügers Archiv - European Journal of Physiology 468(6), 945–958 (2016)
11. Mignone, P., Pio, G., D’Elia, D., Ceci, M.: Exploiting transfer learning for the
    reconstruction of the human gene regulatory network. Bioinform. 36(5), 1553–1561
    (2020)
12. Pio, G., Ceci, M., D’Elia, D., Loglisci, C., Malerba, D.: A Novel Biclustering Algo-
    rithm for the Discovery of Meaningful Biological Correlations between microRNAs
    and their Target Genes. BMC Bioinformatics 14(Suppl 7), S8 (Apr 2013)
13. Pio, G., Ceci, M., Prisciandaro, F., Malerba, D.: Exploiting causality in gene net-
    work reconstruction based on graph embedding. Machine Learning (2019)
14. Pio, G., Serafino, F., Malerba, D., Ceci, M.: Multi-type clustering and classification
    from heterogeneous networks. Information Sciences 425, 107–126 (Jan 2018)
15. Wang, P., Guo, Q., et al.: Improved method for prioritization of disease associated
    lncRNAs based on ceRNA theory and functional genomics data. Oncotarget 8(3),
    4642–4655 (Dec 2016)

</pre>