-

Positive Unlabeled Link Prediction via Transfer Learning for Gene Network Reconstruction (Discussion Paper)

Paolo Mignone

Gianvito Pio

Michelangelo Ceci paolo.mignone@uniba.it

gianvito.pio@uniba.it

michelangelo.ceci@uniba.it

0 0 Department of Computer Science - University of Bari Aldo Moro Via Orabona , 4 - 70125 Bari , Italy

Transfer learning can be employed to leverage knowledge from a source domain in order to better solve tasks in a target domain, where the available data is exiguous. While most of the previous papers work in the supervised setting, we study the more challenging case of positive-unlabeled transfer learning, where few positive labeled instances are available for both the source and the target domains. Speci cally, we focus on the link prediction task on network data, where we consider known existing links as positive labeled data and all the possible remaining links as unlabeled data. The transfer learning method described in this paper exploits the unlabeled data and the knowledge of a source network in order to improve the reconstruction of a target network. Experiments, conducted in the biological eld, showed the e ectiveness of the proposed approach with respect to the considered baselines, when exploiting the Mus Musculus gene network (source) to improve the reconstruction of the Homo Sapiens Sapiens gene network (target).

The link prediction task aims at estimating the probability of the existence of an interaction between two entities on the basis of the available set of known interactions, which belong to the same data distribution, the same context and described according to the same features. However, in many real cases, the identical data distribution assumption does not hold, especially if data are organized in heterogeneous information networks [ 12, 13 ]. For example, in the study of biological networks, collecting training data is very expensive and it is necessary to build link prediction models on the basis of data regarding di erent (even if related) contexts. At this regard, transfer learning strategies can be adopted to leverage knowledge from a source domain to improve the performance of a task solved on a target domain, for which we have few labeled data (see Figure 1). In the literature, we can nd several applications where transfer learning approaches have been proved to be bene cial. For example, in the classi cation of Web documents, transfer learning approaches can be exploited to classify newly Copyright c 2019 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy. created Web sites which follow a di erent data distribution [ 3 ] (e.g., the content is related to new subtopics). Another example is the work proposed in [ 9 ], where the authors exploit transfer learning approaches to adapt a WiFi localization model trained in one time period (source domain) to a new time period (target domain), where the available data possibly follow a di erent data distribution.

Focusing on the link prediction task, in the literature we can nd several works in the biological eld, since biological entities and their relationships can be naturally represented as a network. In the speci c eld of genomics, the working mechanisms of several organisms are usually modeled through gene interaction networks, where nodes represent genes and edges represent regulation activities. Since gene expression data are easy to obtain, several methods have been proposed in the literature that exploit this kind of data [ 7 ]. These approaches analyze the expression level of the genes under di erent conditions. Therefore, most machine learning approaches aiming to solve the link prediction task generally analyze gene expression data, with the nal goal of reconstructing the whole network structure (Gene Network Reconstruction - GNR).

While existing methods generally work e ectively on su ciently large training data, a transfer learning approach could favor the GNR of speci c organisms which are not well studied, by exploiting the knowledge acquired about di erent, related organisms. The main contribution of our work [ 8 ] is to evaluate the possible bene ts that transfer learning techniques can provide to the task of link prediction. In particular, we exploit the available information about a source network for the reconstruction of a target network with poor available data. Moreover, we study the more challenging setting, i.e., the Positive-Unlabeled (PU) setting, where few positive labeled examples are available for both the domains, and no negative example is available [ 2, 11 ]. PU learning setting holds in many real contexts (e.g., text categorization, bioinformatics) where it is very expensive or unfeasible to obtain negative examples for the studied concept.

In order to evaluate the performance of the proposed method, we performed experiments in the biological domain. In particular, our experiments focused on the reconstruction of the human (Homo Sapiens Sapiens) gene network guided by the gene network of another, related organism, i.e., the mouse (Mus Musculus). 2

The proposed method

In this section, we describe our transfer learning approach to solve link prediction tasks in network data. Before describing it in details, we introduce some useful notions and formally de ne the link prediction task for a single domain. Let: { V be the set of nodes of the network; { x = hv0; v00i 2 (V V ) be a (possible) link between v0 and v00, where v0 6= v00; { e(v) = [e1(v); e2(v); : : : ; en(v)] be the vector of features related to the node v, where ei(v) 2 R; 8i 2 f1; 2; : : : ; ng; { e(x) = [e1(v0); e2(v0); : : : ; en(v0); e1(v00); e2(v00); : : : ; en(v00)] be the vector of features related to the link x = hv0; v00i; { sim(a; b) 2 [0; 1] be a similarity function between the vectors a and b; { l(x) be a function that returns 1 if the link x is a known existing link, and 0 if its existence is unknown; { L = fx j x 2 (V V ) ^ l(x) = 1g be the set of labeled links; { U = (V V ) n L be the set of unlabeled links; { D = fXe ; P (X)g be the domain described by the feature space Xe = R2n, with a speci c marginal data distribution P (X), where X = L [ U ; { w(x) (0 w(x) 1) be a computed weight for the link x 2 U , which estimates the degree of certainty of its existence; { f (x) be an ideal function which returns 1 if x exists, and 0 otherwise. The task we intend to solve is then de ned as follows: Given: a set of training examples fhe(x); w(x)igx, each of which described by a feature vector and a weight.

Find: a function f 0 : R2n ! [0; 1] which takes as input a vector of features e(x) and returns the probability that x exists. Thus, f 0(e(x)) P(f (x) = 1) or, in other terms, f 0 approximates the probability distribution over the values of f .

Our method works with two di erent domains: the source domain Ds = fXfs; P (Xs)g, and the target domain Dt = fXft; P (Xt)g, which are described according to the same feature space, i.e., Xfs = Xft, while the marginal data distributions is generally di erent, i.e., P (Xs) 6= P (Xt).

Given the two sets of labeled examples Ls and Lt, regarding the source and the target domain respectively, the method consists of three stages, that are summarized in Figure 2 and detailed in the following subsections. Stage I - Clustering. The rst stage of our method consists in the identi cation of a clustering model for the positive examples of each domain (i.e., on Ls and Lt). The application of a clustering method is motivated by the necessity to distinguish among possible multiple viewpoints of the underlying concept of positive interactions. Moreover, a summarization in terms of clusters' centroids becomes useful also from a computational viewpoint, since in the subsequent stages we can compare centroids instead of single instances. We adopt the classical k-means algorithm, since it is well established in the literature. However, any other prototype-based clustering algorithm, possibly able to catch speci c peculiarities of the data at hand, could be plugged into our method. Stage II - Instance Weighting. Although an unlabeled link could be either a positive or a negative example, we consider all the unlabeled examples as positive examples and compute a weight representing the degree of certainty in [0; 1] of being a positive example: a value close to 0 (respectively, to 1) means that the Clustering Algorithm weighting Weighted

Links union Source & Target

Weighted Links Unlabeled

Links n centroids m centroids I Iw weeigighhtitningg

Unlabeled

Links Stage 1 Stage 2 similarities Centroid

Similarities training

WSVM Stage 3 Cluster 2A

Cluster 1B B

Positive Links SOURCE

DOMAIN Clustering

Algorithm example is likely to be a negative (respectively, positive) example. The weight associated to the unlabeled instances of both the source and the target domains are computed according to their similarities with respect to the centroids obtained in the rst stage. In particular, we identify a di erent weighting function for the source and target domains, in order to smooth the contribution provided by instances coming from the source domain.

Speci cally, an unlabeled link x belonging to the target network (i.e., x 2 Vt Vt) is weighted according to its similarity with respect to the centroid of its closest cluster, among the clusters identi ed from the target network. Formally: w(x) = maxct2Ct (sim(e(x); ct)); (1) where Ct are the clusters identi ed from positive examples of the target network.

On the other hand, an unlabeled link x0 belonging to the source network (i.e., x0 2 Vs Vs) is weighted by considering two similarity values: i) the similarity with respect to the centroid of its closest cluster, computed among the clusters identi ed from the source network, and ii) the similarity between such a centroid and the closest centroid identi ed on the target network. Formally, let c0 = argmaxcs2Cs (sim(e(x0); cs)) be the closest centroid with respect to x0 among the possible centroids Cs identi ed in the source network. Then: w(x0) = sim(e(x0); c0) maxc002Ct (sim(c0; c00)): (2) As a similarity function, we exploit the Euclidean distance, after applying a minmax normalization (in the range [0; 1]) to all the features of the feature vectors. Formally, sim(e(x0); e(x00)) = 1 qPkn=1 (ek(x0) ek(x00))2.

An overview of the weighting strategy can be graphically observed in Figure 3. Stage III - Training the classi er. In the third stage, we train a probabilistic classi er, based on linear Weighted Support Vector Machines (WSVM) [ 14 ] with Platt scaling [ 1 ], from the weighted unlabeled instances coming from both the source and the target networks. We selected an SVM-based classi er mainly because i) it has a (relatively) good computational e ciency, especially in the prediction phase, and ii) it already proved to be e ective (with Platt scaling) in the semi-supervised setting [ 4 ]. At the end of the training phase, the WSVM classi er produces a model in the form of an hyperplane function h. Then, by exploiting the Platt scaling, for each unlabeled link x, we compute the proba1 bility of being a positive example as f 0(e(x)) = 1+e h(e(x)) , where h(e(x)) is the score obtained by the learned WSVM. Finally, we rank all the predicted links in descending ordering with respect to the their probability of being positive. A pseudo-code representation of the proposed method is shown in Algorithm 1. 3

Experiments

Our experiments have been performed in the biological eld. In particular, the speci c task we intend to solve is the reconstruction of the human (Homo Sapiens Sapiens - HSS ) gene network. As a source task, we will exploit the gene network of another, related organism, i.e., the mouse (Mus Musculus - MM ).

3.1 Dataset

The considered dataset consists of gene expression data related to 6 organs (lung, liver, skin, brain, bone marrow, heart), obtained by the samples available at Gene Expression Omnibus (GEO), a public functional genomics repository. On overall, 161 and 174 samples were considered for MM and HSS, respectively, involving a total of 45; 101 and 54; 675 genes, respectively, each of which described by 6 features (one for each organ). Accordingly, the interactions among genes were described by 12-dimensional feature vectors, obtained by concatenating the feature vectors associated to the involved genes.

The set of validated gene interactions (veri ed positive examples) was extracted from the public repository BioGRID (https://thebiogrid.org). Since our method exploits the k-means clustering algorithm, we performed the experiments with di erent values for k1 (i.e., the number of clusters for the MM organism) and k2 (i.e., the number of clusters for the HSS organism), in order to evaluate the possible e ect of such parameters on the results. In particular, we considered the following parameter values: k1 2 f2; 3g; k2 2 f2; 3g. We remind that we work in the PU learning setting (i.e., the dataset does not contain any negative example). Therefore, inspired by [ 10 ], we evaluated the results in terms of the average recall@k, in a 10 fold CV setting. In particular, in order to quantitatively compare the obtained results, we draw the recall@k curve, by varying the value of k (i.e, the number of top-ranked interactions to consider as positive), and compute the area under the curve.

We compared our method, indicated as transfer, with two approaches: - no transfer, which corresponds to the WSVM with Platt scaling learned only from the target network (i.e., from the HSS network). This baseline allows us to evaluate the contribution of the source domain. - union, which is the WSVM with Platt scaling learned from a single dataset consisting of the union of the instances coming from both MM and HSS. This baseline allows us to evaluate the e ect of our weighing strategy. Since we are interested in observing the contribution provided by our approach with respect to the non-transfer approach, results will be evaluated in terms of improvement with respect to the no transfer baseline. 3.3

Results

In Figure 4, we show the results obtained with di erent values of k1 and k2. In particular, we considered di erent percentages of the recall@k curve and measured the area under each sub-curve. This evaluation is motivated by the fact that biologists usually focus their in-lab studies on the analysis of the top-ranked predicted interactions. Therefore, a better result in the rst part of the recall@k curve (i.e., at 1%, 2%) appears to be more relevant for the real biological application. The graphs show that both the union baseline and the proposed method (transfer) are able to obtain a better result with respect to the variant without any transfer of knowledge (no transfer). This con rms that, in this case, the external source of knowledge (the MM gene network) can be exploited to improve the reconstruction of the target network (the HSS gene network).

By comparing the union baseline with our method, we can observe that the proposed weighting strategy was e ective in assigning the right contribution to each unlabeled instance (coming either from the source or from the target network) in the learning phase. This is even more evident in the rst part of the recall@k curve, where our method was able to retrieve about 120 additional true interactions at the top 1% of the ranking with respect to the baseline approaches.

Finally, by analyzing the results with respect to the values of k1 and k2, we can conclude that the highest improvement over the baseline approaches has been obtained with k1 = 2 and k2 = 2. This means that clustering can a ect the results, and that even higher improvements could be obtained by adopting smarter clustering strategies that can, for example, catch and exploit the distribution, in terms of density, of the examples in the feature space. We proposed a transfer learning method to solve the link prediction task in the PU learning setting. By resorting to a clustering-based strategy, our method is able to exploit unlabeled data as well as labeled and unlabeled data of a di erent, related domain, identifying a di erent weight for each training instance. Focusing on biological networks, we evaluated the performance of the proposed method in the reconstruction of the Human gene network, supported by the mouse gene network. Results show that our method is able to improve the accuracy of the reconstruction, with respect to baseline approaches. As future work, we plan to implement a distributed version of the proposed method and to adopt some ensemble-based approaches [ 5, 6 ] to exploit multiple clusters in the prediction.

Acknowledgments

We would like to acknowledge the European project MAESTRA - Learning from Massive, Incompletely annotated, and Structured Data (ICT-2013-612944).

1. J. c. Platt. Probabilistic outputs for support vector machine and comparisons to regularized likelihood methods . In Advances in large margin classi ers , 1999 .

Ceci ,

Pio ,

Kuzmanovski , and S. Dzeroski. Semi-supervised multi-view learning for gene network reconstruction . PLOS ONE , 10 ( 12 ): 1 { 27 , 12 2015 .

Dai ,

Yang ,

Xue , and

Yu . Boosting for transfer learning . In Proc. of ICML , pages 193 { 200 , 2007 .

Elkan and

Noto . Learning classi ers from only positive and unlabeled data . In Proc. of ACM SIGKDD , pages 213 { 220 , 2008 .

Levatic ,

Ceci ,

Kocev , and S. Dzeroski. Self-training for multi-target regression with tree ensembles . Knowl.-Based Syst. , 123 : 41 { 60 , 2017 .

Levatic ,

Kocev ,

Ceci , and S. Dzeroski. Semi-supervised trees for multitarget regression . Inf. Sci. , 450 : 109 { 127 , 2018 .

Marbach and J. C. et al. Wisdom of crowds for robust gene network inference . Nature Methods , 2016 .

Mignone and

Pio . Positive unlabeled link prediction via transfer learning for gene network reconstruction . In ISMIS 2018 , pages 13 { 23 , 2018 .

S. J.

Pan ,

V. W.

Zheng ,

Yang , and

D. H.

Hu . Transfer learning for wi -based indoor localization . Workshop on Transfer Learning for Complex Task AAAI , 2008 .

10. G. Pio,

Ceci ,

Malerba , and D. D'Elia. ComiRNet: a web-based system for the analysis of miRNA-gene regulatory networks . BMC Bioinform , 16 (S-9): S7 , 2015 .

11. G. Pio,

Malerba , D. D'Elia , and M. Ceci . Integrating microRNA target predictions for the discovery of gene regulatory networks: a semi-supervised ensemble learning approach . BMC Bioinform , 15 (S-1): S4 , 2014 .

12.

Pio , F. Sera no,

Malerba , and

Ceci . Multi-type clustering and classi - cation from heterogeneous networks . Inf. Sci. , 425 : 107 { 126 , 2018 .

13. F. Sera no, G. Pio, and

Ceci . Ensemble learning for multi-type classi cation in heterogeneous networks . IEEE Trans. Knowl . Data Eng., 30 ( 12 ): 2326 { 2339 , 2018 .

14.

Yang ,

Song , and

Wand . A weighted support vector machine for data classi cation . In Int J Pattern Recogn , Vol. 21 , 2007 .