Positive Unlabeled Link Prediction via Transfer
  Learning for Gene Network Reconstruction
              (Discussion Paper)

              Paolo Mignone, Gianvito Pio and Michelangelo Ceci
    paolo.mignone@uniba.it, gianvito.pio@uniba.it, michelangelo.ceci@uniba.it

           Department of Computer Science - University of Bari Aldo Moro
                        Via Orabona, 4 - 70125 Bari (Italy)

        Abstract. Transfer learning can be employed to leverage knowledge
        from a source domain in order to better solve tasks in a target domain,
        where the available data is exiguous. While most of the previous papers
        work in the supervised setting, we study the more challenging case of
        positive-unlabeled transfer learning, where few positive labeled instances
        are available for both the source and the target domains. Specifically, we
        focus on the link prediction task on network data, where we consider
        known existing links as positive labeled data and all the possible re-
        maining links as unlabeled data. The transfer learning method described
        in this paper exploits the unlabeled data and the knowledge of a source
        network in order to improve the reconstruction of a target network. Ex-
        periments, conducted in the biological field, showed the effectiveness of
        the proposed approach with respect to the considered baselines, when
        exploiting the Mus Musculus gene network (source) to improve the re-
        construction of the Homo Sapiens Sapiens gene network (target).


1     Introduction
The link prediction task aims at estimating the probability of the existence of
an interaction between two entities on the basis of the available set of known
interactions, which belong to the same data distribution, the same context and
described according to the same features. However, in many real cases, the iden-
tical data distribution assumption does not hold, especially if data are organized
in heterogeneous information networks [12, 13]. For example, in the study of bi-
ological networks, collecting training data is very expensive and it is necessary
to build link prediction models on the basis of data regarding different (even if
related) contexts. At this regard, transfer learning strategies can be adopted to
leverage knowledge from a source domain to improve the performance of a task
solved on a target domain, for which we have few labeled data (see Figure 1).
In the literature, we can find several applications where transfer learning ap-
proaches have been proved to be beneficial. For example, in the classification of
Web documents, transfer learning approaches can be exploited to classify newly
    Copyright c 2019 for the individual papers by the papers’ authors. Copying per-
    mitted for private and academic purposes. This volume is published and copyrighted
    by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy.
       Fig. 1. Knowledge on a source task exploited to solve the target task.


created Web sites which follow a different data distribution [3] (e.g., the content
is related to new subtopics). Another example is the work proposed in [9], where
the authors exploit transfer learning approaches to adapt a WiFi localization
model trained in one time period (source domain) to a new time period (target
domain), where the available data possibly follow a different data distribution.
    Focusing on the link prediction task, in the literature we can find several
works in the biological field, since biological entities and their relationships can
be naturally represented as a network. In the specific field of genomics, the
working mechanisms of several organisms are usually modeled through gene in-
teraction networks, where nodes represent genes and edges represent regulation
activities. Since gene expression data are easy to obtain, several methods have
been proposed in the literature that exploit this kind of data [7]. These ap-
proaches analyze the expression level of the genes under different conditions.
Therefore, most machine learning approaches aiming to solve the link prediction
task generally analyze gene expression data, with the final goal of reconstructing
the whole network structure (Gene Network Reconstruction - GNR).
    While existing methods generally work effectively on sufficiently large train-
ing data, a transfer learning approach could favor the GNR of specific organisms
which are not well studied, by exploiting the knowledge acquired about differ-
ent, related organisms. The main contribution of our work [8] is to evaluate the
possible benefits that transfer learning techniques can provide to the task of link
prediction. In particular, we exploit the available information about a source
network for the reconstruction of a target network with poor available data.
Moreover, we study the more challenging setting, i.e., the Positive-Unlabeled
(PU) setting, where few positive labeled examples are available for both the do-
mains, and no negative example is available [2, 11]. PU learning setting holds
in many real contexts (e.g., text categorization, bioinformatics) where it is very
expensive or unfeasible to obtain negative examples for the studied concept.
    In order to evaluate the performance of the proposed method, we performed
experiments in the biological domain. In particular, our experiments focused on
the reconstruction of the human (Homo Sapiens Sapiens) gene network guided by
the gene network of another, related organism, i.e., the mouse (Mus Musculus).

2   The proposed method
In this section, we describe our transfer learning approach to solve link prediction
tasks in network data. Before describing it in details, we introduce some useful
notions and formally define the link prediction task for a single domain. Let:
– V be the set of nodes of the network;
– x = hv 0 , v 00 i ∈ (V × V ) be a (possible) link between v 0 and v 00 , where v 0 6= v 00 ;
– e(v) = [e1 (v), e2 (v), . . . , en (v)] be the vector of features related to the node v,
where ei (v) ∈ R, ∀i ∈ {1, 2, . . . , n};
– e(x) = [e1 (v 0 ), e2 (v 0 ), . . . , en (v 0 ), e1 (v 00 ), e2 (v 00 ), . . . , en (v 00 )] be the vector of
features related to the link x = hv 0 , v 00 i;
– sim(a, b) ∈ [0, 1] be a similarity function between the vectors a and b;
– l(x) be a function that returns 1 if the link x is a known existing link, and 0
if its existence is unknown;
– L = {x | x ∈ (V × V ) ∧ l(x) = 1} be the set of labeled links;
– U = (V × V ) \ L be the set of unlabeled links;
– D = {X,  e P (X)} be the domain described by the feature space X                               e = R2n , with
a specific marginal data distribution P (X), where X = L ∪ U ;
– w(x) (0 ≤ w(x) ≤ 1) be a computed weight for the link x ∈ U , which estimates
the degree of certainty of its existence;
– f (x) be an ideal function which returns 1 if x exists, and 0 otherwise.
The task we intend to solve is then defined as follows:
Given: a set of training examples {he(x), w(x)i}x , each of which described by a
feature vector and a weight.
Find: a function f 0 : R2n → [0, 1] which takes as input a vector of features e(x)
and returns the probability that x exists. Thus, f 0 (e(x)) ≈ P(f (x) = 1) or, in
other terms, f 0 approximates the probability distribution over the values of f .
     Our method works with two different domains: the source domain Ds =
{Xfs , P (Xs )}, and the target domain Dt = {X                      ft , P (Xt )}, which are described
according to the same feature space, i.e., X                    fs = X     ft , while the marginal data
distributions is generally different, i.e., P (Xs ) 6= P (Xt ).
     Given the two sets of labeled examples Ls and Lt , regarding the source and
the target domain respectively, the method consists of three stages, that are
summarized in Figure 2 and detailed in the following subsections.

Stage I - Clustering. The first stage of our method consists in the identifica-
tion of a clustering model for the positive examples of each domain (i.e., on Ls
and Lt ). The application of a clustering method is motivated by the necessity
to distinguish among possible multiple viewpoints of the underlying concept of
positive interactions. Moreover, a summarization in terms of clusters’ centroids
becomes useful also from a computational viewpoint, since in the subsequent
stages we can compare centroids instead of single instances. We adopt the clas-
sical k-means algorithm, since it is well established in the literature. However,
any other prototype-based clustering algorithm, possibly able to catch specific
peculiarities of the data at hand, could be plugged into our method.

Stage II - Instance Weighting. Although an unlabeled link could be either a
positive or a negative example, we consider all the unlabeled examples as posi-
tive examples and compute a weight representing the degree of certainty in [0, 1]
of being a positive example: a value close to 0 (respectively, to 1) means that the
            TARGET                                                                                                       SOURCE
            DOMAIN       Positive Links                                                           Positive Links         DOMAIN


                          Clustering                          Stage 1
                                                                                                     Clustering
                          Algorithm                                                                  Algorithm

            Unlabeled         weighting                                                               I Iweighting
                                                                                                          weighting   Unlabeled
                                           n centroids        Stage 2              m centroids
             Links                                                                                                     Links
                                                              similarities
                                                                                                  Partially Weighted
                                                                                                         Links
                                                             Centroid                        II weighting

                                                            Similarities
                           Weighted                                                                   Weighted
                            Links                                                                      Links

                          union


                                               training                      Stage 3
                         Source & Target
                                                               WSVM                                  Final Model
                         Weighted Links


      Fig. 2. A graphical overview of the proposed transfer learning approach.

                                               Cluster 2A                    Cluster 1B       B
                     Cluster 1A
                         1A
                                              2A
                                                                                   1B


                         3A                                                                                 2B
                                            4A

                                                                                                            Cluster 2B
                Cluster 3A                   Cluster 4A                                 3B

                                A                                                      Cluster 3B
                        Target Domain Centroid                                Source Domain Centroid
                        Target Domain Unlabeled Link                          Source Domain Unlabeled Link
                               w(A) = sim(A, 4A)                             w(B) = sim(B, 1B)*sim(1B, 2A)


Fig. 3. Weighting process. The instance A is weighted according to the similarity to its
most similar centroid (4A), while the instance B is weighted according to the similarity
to its most similar centroid (1B), and to the similarity between 1B and 2A.


example is likely to be a negative (respectively, positive) example. The weight
associated to the unlabeled instances of both the source and the target domains
are computed according to their similarities with respect to the centroids ob-
tained in the first stage. In particular, we identify a different weighting function
for the source and target domains, in order to smooth the contribution provided
by instances coming from the source domain.
    Specifically, an unlabeled link x belonging to the target network (i.e., x ∈
Vt × Vt ) is weighted according to its similarity with respect to the centroid of its
closest cluster, among the clusters identified from the target network. Formally:

                                     w(x) = maxct ∈Ct (sim(e(x), ct )),                                                           (1)

where Ct are the clusters identified from positive examples of the target network.
   On the other hand, an unlabeled link x0 belonging to the source network (i.e.,
 0
x ∈ Vs × Vs ) is weighted by considering two similarity values: i) the similarity
with respect to the centroid of its closest cluster, computed among the clusters
identified from the source network, and ii) the similarity between such a centroid
and the closest centroid identified on the target network. Formally, let c0 =
argmaxcs ∈Cs (sim(e(x0 ), cs )) be the closest centroid with respect to x0 among
the possible centroids Cs identified in the source network. Then:
                  w(x0 ) = sim(e(x0 ), c0 ) · maxc00 ∈Ct (sim(c0 , c00 )).        (2)
As a similarity function, we exploit the Euclidean distance, after applying a min-
max normalization (in the range [0,  q1])  to all the features of the feature vectors.
                                          n                   00 2
Formally, sim(e(x0 ), e(x00 )) = 1 −                0
                                      P
                                          k=1 (ek (x ) − ek (x )) .
An overview of the weighting strategy can be graphically observed in Figure 3.

Stage III - Training the classifier. In the third stage, we train a probabilis-
tic classifier, based on linear Weighted Support Vector Machines (WSVM) [14]
with Platt scaling [1], from the weighted unlabeled instances coming from both
the source and the target networks. We selected an SVM-based classifier mainly
because i) it has a (relatively) good computational efficiency, especially in the
prediction phase, and ii) it already proved to be effective (with Platt scaling)
in the semi-supervised setting [4]. At the end of the training phase, the WSVM
classifier produces a model in the form of an hyperplane function h. Then, by
exploiting the Platt scaling, for each unlabeled link x, we compute the proba-
bility of being a positive example as f 0 (e(x)) = 1+e−h(e(x))
                                                       1
                                                               , where h(e(x)) is the
score obtained by the learned WSVM. Finally, we rank all the predicted links
in descending ordering with respect to the their probability of being positive. A
pseudo-code representation of the proposed method is shown in Algorithm 1.

3    Experiments
Our experiments have been performed in the biological field. In particular, the
specific task we intend to solve is the reconstruction of the human (Homo Sapiens
Sapiens - HSS ) gene network. As a source task, we will exploit the gene network
of another, related organism, i.e., the mouse (Mus Musculus - MM ).

3.1 Dataset
The considered dataset consists of gene expression data related to 6 organs (lung,
liver, skin, brain, bone marrow, heart), obtained by the samples available at
Gene Expression Omnibus (GEO), a public functional genomics repository. On
overall, 161 and 174 samples were considered for MM and HSS, respectively, in-
volving a total of 45, 101 and 54, 675 genes, respectively, each of which described
by 6 features (one for each organ). Accordingly, the interactions among genes
were described by 12-dimensional feature vectors, obtained by concatenating the
feature vectors associated to the involved genes.
    The set of validated gene interactions (verified positive examples) was ex-
tracted from the public repository BioGRID (https://thebiogrid.org).
 Algorithm 1: PU Link Prediction via Transfer Learning
      Data:
         ·Ls = {xs | xs ∈ (Vs × Vs ) ∧ l(xs ) = 1}: positive links of the source network
         ·Us = (Vs × Vs ) \ Ls : unlabeled links of the source network
         ·Lt = {xt | xt ∈ (Vt × Vt ) ∧ l(xt ) = 1}: positive links of the target network
         ·Ut = (Vt × Vt ) \ Lt : unlabeled links of the target network
         ·e(x): feature vector of theqPlink x
         ·sim(e(x0 ), e(x00 )) = 1 −      n         0         00 2
                                          k=1 (ek (x ) − ek (x ))
         ·k1 , k2 : number of positive clusters for the source and the target network, respectively
      Result:
         ·ranked links: predicted links ordered according to their likelihood
 1    begin
 2        Cs ← kmeans(Ls , k1 ); Ct ← kmeans(Lt , k2 );
 3        source training set ← ∅; target training set ← ∅; ranked links ← ∅;
 4        foreach xt ∈ Ut do
 5            w(xt ) ← max(ct ∈Ct ) (sim(e(xt ), ct ));
 6            target training set ← target training set ∪ {he(xt ), w(xt )i};
 7        foreach xs ∈ Us do
 8            source centroid ← argmax(cs ∈Cs ) (sim(e(xs ), cs ));
 9            partial weight ← sim(e(xs ), source centroid);
10            centroid sim ← max(ct ∈Ct ) (sim(source centroid, ct ));
11            w(xs ) ← partial weight · centroid sim;
12            source training set ← source training set ∪ {he(xs ), w(xs )i};
13        training set ← source training set ∪ target training set;
14        h(·) ← W SV M (training set);
15        f 0 (·) =  1
                     −h(·)
                           ;
                  1+e
16        foreach x ∈ Ut do
17            ranked links ← ranked links ∪ {hx, f 0 (e(x))i};
18        ranked links ← sort by score(ranked links);
19        return ranked links;


3.2      Experimental Setting
Since our method exploits the k-means clustering algorithm, we performed the
experiments with different values for k1 (i.e., the number of clusters for the MM
organism) and k2 (i.e., the number of clusters for the HSS organism), in order to
evaluate the possible effect of such parameters on the results. In particular, we
considered the following parameter values: k1 ∈ {2, 3}, k2 ∈ {2, 3}. We remind
that we work in the PU learning setting (i.e., the dataset does not contain
any negative example). Therefore, inspired by [10], we evaluated the results in
terms of the average recall@k, in a 10 fold CV setting. In particular, in order
to quantitatively compare the obtained results, we draw the recall@k curve, by
varying the value of k (i.e, the number of top-ranked interactions to consider as
positive), and compute the area under the curve.
We compared our method, indicated as transfer, with two approaches:
- no transfer, which corresponds to the WSVM with Platt scaling learned only
from the target network (i.e., from the HSS network). This baseline allows us to
evaluate the contribution of the source domain.
- union, which is the WSVM with Platt scaling learned from a single dataset
consisting of the union of the instances coming from both MM and HSS. This
baseline allows us to evaluate the effect of our weighing strategy.
      Fig. 4. Improvement over no transfer, with different values of k1 and k2 .


Since we are interested in observing the contribution provided by our approach
with respect to the non-transfer approach, results will be evaluated in terms of
improvement with respect to the no transfer baseline.


3.3   Results
In Figure 4, we show the results obtained with different values of k1 and k2 . In
particular, we considered different percentages of the recall@k curve and mea-
sured the area under each sub-curve. This evaluation is motivated by the fact
that biologists usually focus their in-lab studies on the analysis of the top-ranked
predicted interactions. Therefore, a better result in the first part of the recall@k
curve (i.e., at 1%, 2%) appears to be more relevant for the real biological appli-
cation. The graphs show that both the union baseline and the proposed method
(transfer) are able to obtain a better result with respect to the variant with-
out any transfer of knowledge (no transfer). This confirms that, in this case,
the external source of knowledge (the MM gene network) can be exploited to
improve the reconstruction of the target network (the HSS gene network).
    By comparing the union baseline with our method, we can observe that
the proposed weighting strategy was effective in assigning the right contribution
to each unlabeled instance (coming either from the source or from the target
network) in the learning phase. This is even more evident in the first part of the
recall@k curve, where our method was able to retrieve about 120 additional true
interactions at the top 1% of the ranking with respect to the baseline approaches.
    Finally, by analyzing the results with respect to the values of k1 and k2 ,
we can conclude that the highest improvement over the baseline approaches
has been obtained with k1 = 2 and k2 = 2. This means that clustering can
affect the results, and that even higher improvements could be obtained by
adopting smarter clustering strategies that can, for example, catch and exploit
the distribution, in terms of density, of the examples in the feature space.
4    Conclusion and Future Work
We proposed a transfer learning method to solve the link prediction task in the
PU learning setting. By resorting to a clustering-based strategy, our method is
able to exploit unlabeled data as well as labeled and unlabeled data of a different,
related domain, identifying a different weight for each training instance. Focusing
on biological networks, we evaluated the performance of the proposed method
in the reconstruction of the Human gene network, supported by the mouse gene
network. Results show that our method is able to improve the accuracy of the
reconstruction, with respect to baseline approaches. As future work, we plan
to implement a distributed version of the proposed method and to adopt some
ensemble-based approaches [5, 6] to exploit multiple clusters in the prediction.

Acknowledgments
We would like to acknowledge the European project MAESTRA - Learning from
Massive, Incompletely annotated, and Structured Data (ICT-2013-612944).

References
 1. J. c. Platt. Probabilistic outputs for support vector machine and comparisons to
    regularized likelihood methods. In Advances in large margin classifiers, 1999.
 2. M. Ceci, G. Pio, V. Kuzmanovski, and S. Džeroski. Semi-supervised multi-view
    learning for gene network reconstruction. PLOS ONE, 10(12):1–27, 12 2015.
 3. W. Dai, Q. Yang, G. Xue, and Y. Yu. Boosting for transfer learning. In Proc. of
    ICML, pages 193–200, 2007.
 4. C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data.
    In Proc. of ACM SIGKDD, pages 213–220, 2008.
 5. J. Levatic, M. Ceci, D. Kocev, and S. Dzeroski. Self-training for multi-target
    regression with tree ensembles. Knowl.-Based Syst., 123:41–60, 2017.
 6. J. Levatic, D. Kocev, M. Ceci, and S. Dzeroski. Semi-supervised trees for multi-
    target regression. Inf. Sci., 450:109–127, 2018.
 7. D. Marbach and J. C. et al. Wisdom of crowds for robust gene network inference.
    Nature Methods, 2016.
 8. P. Mignone and G. Pio. Positive unlabeled link prediction via transfer learning for
    gene network reconstruction. In ISMIS 2018, pages 13–23, 2018.
 9. S. J. Pan, V. W. Zheng, Q. Yang, and D. H. Hu. Transfer learning for wifi-based
    indoor localization. Workshop on Transfer Learning for Complex Task AAAI, 2008.
10. G. Pio, M. Ceci, D. Malerba, and D. D’Elia. ComiRNet:a web-based system for the
    analysis of miRNA-gene regulatory networks. BMC Bioinform, 16(S-9):S7, 2015.
11. G. Pio, D. Malerba, D. D’Elia, and M. Ceci. Integrating microRNA target pre-
    dictions for the discovery of gene regulatory networks: a semi-supervised ensemble
    learning approach. BMC Bioinform, 15(S-1):S4, 2014.
12. G. Pio, F. Serafino, D. Malerba, and M. Ceci. Multi-type clustering and classifi-
    cation from heterogeneous networks. Inf. Sci., 425:107–126, 2018.
13. F. Serafino, G. Pio, and M. Ceci. Ensemble learning for multi-type classification in
    heterogeneous networks. IEEE Trans. Knowl. Data Eng., 30(12):2326–2339, 2018.
14. X. Yang, Q. Song, and Y. Wand. A weighted support vector machine for data
    classification. In Int J Pattern Recogn, Vol. 21, 2007.