=Paper= {{Paper |id=Vol-2871/paper1 |storemode=property |title=Research on Heterogeneous Enhanced Network Embedding for Cooperation Prediction |pdfUrl=https://ceur-ws.org/Vol-2871/paper1.pdf |volume=Vol-2871 |authors=Xin Zhang,Yi Wen,Haiyun Xu |dblpUrl=https://dblp.org/rec/conf/iconference/ZhangWX21 }} ==Research on Heterogeneous Enhanced Network Embedding for Cooperation Prediction== https://ceur-ws.org/Vol-2871/paper1.pdf
                                                                  1st Workshop on AI + Informetrics - AII2021




                  Research on Heterogeneous Enhanced Network
                     Embedding for Collaboration Prediction

       Xin Zhang1[0000-0001-8784-3788] ,Yi Wen1[0000-0002-6520-2733] and Haiyun Xu2[0000-0002-7453-3331]
              1
               Chengdu Library and Information Center,CAS, Chengdu 610041, Sichuan, China
          2
              Business School, Shandong University of Technology ,Zibo 255049,Shandong, China
                                       1
                                         zhangxin@clas.ac.cn



                  Abstract. Scientific research collaboration has always been an important
                  research content of information science, and collaboration prediction is an
                  important issue in personalized information services. This article constructs an
                  author-centered heterogeneous information fusion schema, and uses causal
                  analysis to quantitatively study the influence of same institutions, co-word and
                  citation on collaboration, compares the effects of different network embedding
                  algorithms in collaboration prediction, and built heterogeneous information
                  fused network embedding model for collaboration prediction. Take the field of
                  stem cells as an empirical case, experiments show that the matrix
                  decomposition based network embedding algorithms(like NetMF) are balance
                  of performance and accuracy. Institutions, keywords and citations can improve
                  the prediction effect of collaboration, and (same institutions > citations > co
                  -words) among them. Multiple features fusion models are generally better than
                  single information fusion, and the model of (collaboration + same institution +
                  citations) performs outstanding in collaboration prediction.

                  Keywords: collaboration Prediction , Network embedding, Heterogeneous
                  information.


      1           Introduction

           Since the birth of information science, scientific research collaboration has
      become an important research topic. The prediction of collaboration has important
      theoretical and practical significance for the analysis of S&T trends and the
      recommendation of personalized service information. Collaboration prediction is also
      a very challenging task. Scholars in different fields have invested in this topic. Most
      computer scientists design more and more advanced and complex network
      representation learning algorithms, and find ways to incorporate different types of
      information such as text into representation learning. Intelligence experts pay more
      attention to the application of these algorithms for collaborative recommendation

           Newman[1] was the first to introduce the network analysis into research
      collaboration. Liben-Nowell and Kleinberg [2] put forward the problem of link
      prediction in social networks, they gave some similarity measurements based on




Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
                                                                                       2


network structure, and applied two series of index methods based on nodes and paths,
empirical analysis is carried out in the collaboration network of authors in five fields
of physics. Since then, quit a lot of work focused on improve the indicators in this
article. In the book "Link Prediction", Lv Linyuan introduced a series of
similarity-based link prediction indicators, such as common neighbor index (CN),
cosine similarity, Jaccard Indicators, Adamic-Adar (AA) indicators, resource
allocation (RA) indicators, LHN-I indicators and others based on the local
information of nodes, as well as local path indicators (LP), Katz indicators, restart
random walks, etc. based on the information of edges or paths[3]. Intelligence
experts has conducted a series of empirical studies based on these indicators, such as
R Guns and R Rousseau [4] ,they constructed a weighted collaboration network in the
field of malaria and tuberculosis to carry out cooperative prediction and
recommendation work. Yan E et al. [5] used the papers of 59 journals in the field of
library and information as an empirical case to compare the prediction results under
various indicators. Shan Songyan et al. [6] summarized and reviewed the author
similarity algorithm for cooperative prediction. In addition, cooperative prediction
also has methods based on maximum likelihood estimation, methods based on
probability graph models, etc. The latter two approaches are based on statistical ideas
and have achieved certain effects on specific networks. The computational complexity
of those method is too high, and it is difficult to implement on large-scale networks.

      In response to the large-scale sparse network prediction problems encountered in
actual research, researchers continue to propose new methods to learn a
low-dimensional dense representation of the network. Then use low-dimensional
representation for structural prediction work. For example, Zhang Jinzhu et al[7]
introduced the method of network representation learning into the collaboration
prediction. In this paper, the LINE model is used to construct the vector
representation of author, and the cosine similarity is used to measure the possibility of
collaboration. Yu Chuanming et al[8] studied the application of network
representation learning methods in collaborative recommendation, proposed an
integrated recommendation model, and conducted empirical analysis in the financial
field.

     Moreover, it must be noted that collaboration is affected by many complicated
factors. In addition to the network structure, collaboration prediction should also use
rich heterogeneous structure information. Wang Zhibing et al. [9] merged the attribute
characteristics such as the node mechanism into the similarity index of the network
structure, and carried out collaboration prediction. Liu Ping et al. [10] constructed an
LDA-based author interest model based on the community division, and then
analyzed the author’s relevant literature to achieve the purpose of scientific research
collaboration recommendation; Yu Chuanming et al. [11] constructed the
collaboration network in the financial field and adopted a link prediction method
based on feature fusion of individuals, institutions and regions. Lin Yuan et al. [12]
combined the heterogeneous information of scholars, institutions, and keywords to
construct a scientific research collaboration network, and then used node2vec to
                                                                                        3


express learning methods for cooperative prediction. Zhang Xin et al. [13] proposed a
scientific research collaboration prediction method that combines network
representation learning and author's topic characteristics .On the basis of these studies,
this paper constructs a cooperative prediction method of feature fusion.


2      Ideas and Methods

2.1    Research Framework




                       Fig. 1.   Research Framework of this paper

      The research framework of this paper is shown in Figure 1. It can be roughly
divided into 6 stages, namely (1) data collection -> (2) exploratory data analysis and
network construction -> (3) the impact of heterogeneous information on scientific
research collaboration -> (4) embedding algorithms evaluation for link prediction ->
(5)
The construction of heterogeneous features fused network embedding approaches and
(6) Results and analysis. Retrieve the documents to be analyzed from the Web of
Sciences database, and then conduct exploratory analysis on the collaboration
relationship among them, and construct a collaboration network, author-institution
network, author-keyword network, author-citation network. Later, for the same
institutions, shared keywords and citation relationships, the framework of causal
analysis was used to study their impact on collaboration. Then, we discuss the
performance and efficiency of DeepWalk, LINE, HOPE, Node2vec, SDNE, NetMF
and ProNE and other models for collaboration predictions. Finally, we integrate the
Institution(I), Keyword(K) and Citation(C) features to network embedding models
and use these schema to carry out actual network prediction task.

2.2    Research on Heterogeneous Information and Its Impact on collaboration

     Heterogeneous information fusion is abbreviated as data fusion, which refers to
the comprehensive analysis of different types of information sources or relational data
through a specific method, and finally all the information can be used to jointly reveal
the characteristics of the research object, making up for the single data type and the
single relation type to reveal the research field insufficient associations between
                                                                                       4


entities to obtain more comprehensive and objective measurement results (Xu Haiyun
et al. [14]).
      Hua Berlin [15] put forward the fusion theory as one of the main methodologies
of information science , and emphasized the important of data fusion, information
fusion and knowledge fusion in information science. Later, he further discussed the
influence of data fusion on intelligence. The importance of fusion of different types of
multiple source information is analyzed. He also systematically explained the relevant
theories and applications of information fusion from the perspective of information
fusion's representation process, technical algorithms and models. Morris et al. [16]
gave an overview of common measurement entities in scientific and technological
literature, mainly including documents themselves, references, journals where
documents are located, authors of documents, journals where references are located,
authors of reference documents, subject headings etc. Xu Haiyun et al. [14] reviewed
the multiple source data fusion methods in Scientometrics. The document-centered
data fusion meta-path model given in the paper is shown in Figure 2(a). This article
takes the author's collaboration as the research object, and studies the influence of the
same institution, the same keywords, and the citation relationship on scientific
research collaboration. On the basis of Figure 2(a), an author-centered scientific
research collaboration network is constructed, as shown in Figure 2(b).




       (a) Document-centered schema                 (b) Author-centered schema
               Fig. 2. Heterogeneous Information Fusion Schema

             Table. 1. Heterogeneous Information Network Construction
Information   Networks directly extract from         Uniform Network
              documents
collaboratio  collaboration Network                  collaboration Network
n
Institution   Author-Institution Bipartite Network   Same Institution Network
Keyword       Author-Document Bipartite Network      Author Co-word Network
              Document-Keyword Bipartite Network
                                                                                              5


Citation         Author-Document Bipartite Network            Author Citation Network
                 Document Citation Network
      Table 1 shows several methods for constructing heterogeneous network
information. The first column represents the fused heterogeneous information, the
second column represents the feature network that can be directly extracted from the
article, and the third column represents the mapping of the second type of network to
the author dimension The formed network.
      (1) Author collaboration network. The author collaboration network can be
extracted directly from the author field of the article. The nodes in the network
represent authors, the edges in the network represent collaboration relationships, and
the weight of the edges represents the frequency of collaboration.
      (2) The same institution network, extract the author institution bipartite network
from the author-institution (C1) field of the article. An institution can contain multiple
authors, and at the same time, an author can work part-time in multiple institutions,
which can be mapped into Authors are in the same organization network. The nodes
in the network represent authors. If two authors have the same organization to which
they belong, the network is connected.
      (3) The author co-word network, according to the downloaded documents, we
can construct an article-author network, article-keyword network, and then form an
author-keyword bipartite network. The weight d(a,k) in the network indicates that the
the frequency of author a uses keywords k (number of articles),  (a ) represents the
neighbor node of node a, that is, the keyword used by a. According to this network,
we can deduce the author co-word network. The nodes in this network represent the
author. The edge weight w(a,b) between author a and b is calculated by formula (1),
D represents the total number of documents, and d(k) represents the frequency of
keyword k.
                                                                                      D
             w ( a, b)          
                           k ( a )   ( b )
                                                 min( d (a, k ), d (b, k ))  log
                                                                                    d (k )
                                                                                             (1)

      (4) Author citation network, based on the downloaded documents, an
author-article bipartite network can be constructed (the article written by author a in
this network is denoted as  (a ) ), an article citation network, and then an author’s
citation network is formed. The weight c(a,b) in this network is the number of articles
of author b cited in all articles of author a.
                                  c ( a, b)                      (d , c)                 (2)
                                                     d  ( a ) c ( b )

if document d cite document c,  ( d , c )  1 ,Otherwise  ( d , c )  0 .Unlike the
previous two networks, the author cited network is a directed network.

    In this way, we have constructed several heterogeneous information networks.
However, the influence of different heterogeneous information on collaboration may
be different. In order to measure the difference of this influence, we adopt the
paradigm of causal reasoning. ACE (Average Casual Effect) is used to measure the
                                                                                     6


influence of treatment variable Y (same institution, co-word, citation) to variable X
(collaboration).
                  ACE (Y   X )  E ( X | Y )  E ( X |~ Y )                       (3)
The larger the ACE, the more obvious the causal effect of Y on X.

2.3    Network Embedding Algorithms for Cooperation Prediction and
       Evaluation Criteria

      Network embedding is currently a hot method in the field of cooperative
prediction. Scholars continue to incorporate new ideas and methods into network
representation learning, and the accuracy and efficiency of the algorithm continue to
improve. This article compares seven well-known and commonly used network
representation learning algorithms and evaluates their accuracy and performance in
scientific research collaboration prediction.
      Given a network G(V,E), V is the set of vertices, E is the set of edges,
|V|=N,|E|=M, the adjacency matrix of the graph is W.
      (1) DeepWalk, influenced by the well-known word2vec, Perozzi et al. proposed
the DeepWalk [17] in KDD2014. The model uses random walks to generate vertex
sequences, and each node sequence is analogous to a sentence in the language. Each
node pair is analogous to a word, and the Skip-gram method is used for learning and
training, and the vector representation of the node is obtained.
      (2) LINE (Tang Jian et al., 2015) [18] defines the first-order similarity
relationship (1st order proximity) and the second-order proximity relationship (2nd
order proximity) in the graph in the algorithm. The distance is closer. The algorithm
defines first-order and second-order optimization goals to characterize this
relationship.
      (3) Node2vec is a work published on KDD2016 by Grover A et al. [19]. They
improved the random walk strategy in the DeepWalk algorithm, and also considered
the direct neighbor relationship (homophily) and structural similarity relationship of
the node. Using depth-first and breadth-first node traversal, a random walk method is
designed. Assuming a random walk sequence from node u to node v, then for each v's
neighbor node w, the next step is selected according to probability sampling with
parameters p and q.
      (4) SDNE (Wang D et al., 2016) [20], also published on KDD2016, SDNE
model can be seen as an extension of LINE model, the essence of the method is a
Graph AutoEncoder, making graph representation The reconstruction loss of is as
small as possible, and the vector representation of nodes with connected edges is as
close as possible.
      (5) The HOPE [21] method depicts two different representations for each node,
and focuses on preserving the asymmetry information in the original network.
Different asymmetric relationship matrices are constructed, and then the JDGSVD
algorithm is used for matrix reduction. Dimension gets the network representation of
the node.
      (6) NetMF, Tang J team unified deepwalk, line, node2vec and other algorithms
into the framework of matrix decomposition, and proposed a NetMf representation
                                                                                        7


learning method that directly decomposes the target network [22]. After that, they
performed the algorithm again. Improved, introduced sparse matrix factorization, and
proposed NetSMF [23].
      (7) ProNE, Jie Zhang from the team of Professor Tang Jie of Tsinghua
University proposed a fast and scalable large-scale network representation learning
algorithm ProNE[24] at IJCAI 2019. The algorithm is divided into two steps.
1)Sparse matrix factorization for fast embedding initialization and 2) Spectral
propagation in the modulated networks for embedding enhancement. Compared with
the classical random walk method, this method has an efficiency improvement of tens
to hundreds of times.
      The network embedding approaches and graph neural networks are moving from
theory to application. In 2018, the Alimama team open sourced Euler, a distributed
graph deep learning tool, DeepMind open sourced the graph_nets graph network tool,
New York University researchers open sourced the graph neural network learning
framework DGL, and in 2019 Facebook open sourced PyTorch-based graph
representation learning and neural The network framework PyTorch-BigGraph [25].
In addition, recent graph representation learning tools include Tsinghua University’s
data mining team’s CogDL[26] and Tang Jian’s graph representation learning system
GraphVite[27]. The emergence of these platform tools has lowered the application
threshold of graph representation learning, allowing graph representation learning to
be applied in a wider range.

     In this paper, we use those network embedding for collaboration prediction. The
collaboration prediction problem is transformed into a binary classification problem
with/without links. According to the specific ratio, the edge set is divided into training
set and test set respectively, then taking the training set as the baseline, we take the
edges in the test set as positive samples, randomly generate the same number of
negative samples to evaluate the model. We use the AUC, ROC_AUC and F1_Score
indicators to evaluate the accuracy of the model. The AUC indicator is the most
commonly used indicator to evaluate the link prediction problem. Provided the rank of
all non-observed links, the AUC value can be interpreted as the probability that a
randomly chosen missing link is given a higher score than a randomly chosen
nonexistent link [28]. The AUC value ranges from 0.5 to 1. The AUC value of random
allocation method is 0.5, and the AUC value of perfect prediction is 1. The closer the
AUC value is to 1, the better the model effect is, and the ROC_AUC is defined as the
area under the ROC curve in the binary classification problem, while the F1_Score is
the balance of precision and recall in the binary classification problem. Moreover, we
use execution time to measure the efficiency of the model.


2.4    Multiple Feature Fused collaboration Prediction Model

      The method of fusing multiple features is shown in Figure 3. It extracts the basic
collaboration relationship, the same organization relationship, the co-word
relationship of articles, and the citation relationship between authors from the
document collection, and constructs the collaboration network, the same organization
network, the co-word network, and the citation network.
                                                                                       8




                                 Fig. 3. Algorithm Flow
     In this part, this article discusses the influence of several different feature
combination methods on the prediction results of scientific research collaboration.
The fusion features in the article can be divided into two types: single feature fusion
and multiple feature fusion methods. Feature fusion is the result of fusing multiple
sets of features, based on the analysis of the experimental results in the previous part.
The same institution, the same keywords, and the citation relationship all have a
positive relationship to collaboration. This section discusses the influence of several
characteristics and their combinations on the performance of scientific research
collaboration predictions. In this section, we also use the AUC, ROC_AUC and
F1_Score indicators to evaluate the model.
                        Table. 2. Various feature fusion methods
category abbr.                    description
Baseline Raw                      Raw Network
Single       Raw+I(ri)            Raw Network+Same Institution Network
feature      Raw+K(rk)            Raw Network+Author Coword Network
             Raw+C(rc)            Raw Network+Author Citation Network
Multiple Raw+I+K(rik)             Raw Network+ Same Institution Network+ Author
feature                           Citation Network
             Raw+K+C(rkc)         Raw Network+ Author Coword Network+
                                  Author Citation Network
             Raw+I+C(ric)         Raw Network+ Same Institution Network+
                                  Author Citation Network
             Raw+I+K+C(rikc) Raw Network+ Same Institution Network+
                                  Author Coword Network+ Author Citation Network



3      Experiments

3.1    Field selection and exploratory analysis

Research field selection
     Stem cell and regenerative medicine research has brought revolutionary changes
to the treatment of cancer and other diseases. It has been selected as one of the top ten
                                                                                       9


scientific and technological advances in the US "Science" magazine for 9 times. The
project has also laid out related projects many times, so the author selects the field of
stem cells for empirical research. Search in ISI Web of Knowledge with the search
formula (TI=Stem Cells). The search time was May 2019. The search yielded 433 469
articles. The number of articles in different years in the search results is shown in
Figure 4, which shows that the number of articles is presented The trend of slow
growth to rapid growth and then to saturation growth may have declined in 2019 due
to incomplete data collection.




                            Fig. 4.    Number of articles per year

Heterogeneous information extraction
      Analyzed from the ISI database, DE author keywords, C1 field represents the
author institution, the citation relationship between articles is obtained from the CR
field, and the author and institution information is split from the author institution
field. The institution information only intercepts information such as university
hospitals.
      After splitting, 1,682,654 authors were obtained, and 1,461,721 authors were
merged. More than 40,000 authors were selected as the first author to publish articles.
Considering calculation performance, we selected 5,403 of these authors who have
collaborated with other authors more than 2 times. The total number of collaboration
between these authors is 4,818, forming Table 3 shows the basic statistical
characteristics of the benchmark cooperative network by year.

     Table. 3. Annual statistical characteristics of the collaboration network
                 Year         #Nodes       #Edges      Density
                 2007         65           43          2.0673E-02
                 2008         402          296         3.6724E-03
                 2009         529          367         2.6279E-03
                 2010         705          486         1.9584E-03
                                                                                      10


                 2011        844         593         1.6669E-03
                 2012        1044        723         1.3280E-03
                 2013        1255        878         1.1158E-03
                 2014        1355        960         1.0465E-03
                 2015        1515        1136        9.9053E-04
                 2016        1384        969         1.0125E-03
                 2017        1259        847         1.0696E-03
                 2018        1052        745         1.3476E-03
                 2019        392         255         3.3274E-03

The predictability of collaboration networks
      Predictability is an important research problem in link prediction. The
predictability of the network represents the upper limit of prediction. Random
networks are completely unpredictable. Any link prediction algorithm will not get
better results on completely random networks. The predictability of the network is
related to the characteristics and evolution of the network itself. Two articles by
Newman et al [29,30] studied the structural path and other properties of the scientific
research collaboration network. Xu Xiaoke [31], Tan Suoyi [32] and others have
conducted special research on the predictability of the network. Studies have shown
that in networks with good predictability, the largest eigenvalue of the adjacency
matrix of the network is much larger than the second largest eigenvalue.




                    Fig.5. Eigenvalue distribution of adjacency matrix
   Figure 6 shows the eigenvalue distribution of the critical matrix of the studied
network. The largest eigenvalue is about 23.37. There are obvious gaps on the right
and left sides of the figure, and the network will have better predictability.
                                                                                         11


   3.2     The impact of features on collaboration from the perspective of causal

         In this part, we use the method introduced in section 2.2 to compare the effects
   of the same institution, co-word, and citation on the probability of collaboration by
   year.
   Institutional Information
         First, we studied whether the same institution has a causal effect on scientific
   research collaboration in a large network composed of the first authors of all
   cooperative articles, and calculated separately the same institution-there is
   collaboration, the same institution-no collaboration, and different institutions-the side
   of collaboration. The number, and then subtract the first few numbers from all
   possible edges to get the frequency of different organizations-no collaboration. The
   results are shown in Table 4, suppose that event X: there is collaboration between
   authors, and Y: the authors have the same organization.
           Table. 4. Same Institution and its impact of collaboration in the big network
                                   X                   ~X
                 Y                 16205               1908127
                 ~Y                24937               984907682

                            P(X|Y)=16205/(16205+1908127)=8.421e-3
                          P(X|~Y)=24937/(24937+984907682)=2.5318e-5
        The average causal effect of event Y on event X:
                             ACE(Y—>X)=P(X|Y)-P(X|~Y)=8.396e-3
        The probability of collaboration is increased (PI) by 331.62.
        Next, we disassemble the collaboration network of 5403 nodes and 4818 edges
   according to the year to discuss the causal effect of scientific research collaboration
   with the institution each year. The results are shown in Table 5.
              Table.5. The impact of the same institution on collaboration by year
Year     X&Y     X &~Y     ~X& Y        ~X&~Y          P(X|~Y)           P(X|Y)            PI
2007      12       21        4           2043          1.0174E-02       7.5000E-01       72.71
2008      79       139       240         80143         1.7314E-03       2.4765E-01       142.03
2009      116      159       473         138908        1.1433E-03       1.9694E-01       171.25
2010      171      192       1030        246767        7.7746E-04       1.4238E-01       182.14
2011      197      265       1318        353966        7.4810E-04       1.3003E-01       172.82
2012      246      310       1667        542223        5.7139E-04       1.2859E-01       224.05
2013      287      397       2177        784024        5.0611E-04       1.1648E-01       229.14
2014      321      421       2559        914034        4.6038E-04       1.1146E-01       241.10
2015      341      557       2515        1143442       4.8689E-04       1.1940E-01       244.23
2016      332      409       2091        954204        4.2845E-04       1.3702E-01       318.81
2017      288      355       1622        789646        4.4937E-04       1.5079E-01       334.55
2018      247      310       1251        551018        5.6228E-04       1.6489E-01       292.25
2019      80       116       116         76324         1.5175E-03       4.0816E-01       267.97
        It can be seen from Figure 5 that for each time slice of the collaboration network,
   the same institution has a very obvious causal effect on scientific research
                                                                                           12


   collaboration, and the probability of scientific research collaboration with the same
   institution is increased by 200-300 times.

   Keyword information
        This part discusses whether the authors have common keywords that affect
   scientific research collaboration. Let event X: there is collaboration between authors,
   and Y: there are common keywords used by authors. The method similar to section
   3.2.1 is used for annual calculation. There is a causal effect of common use of
   keywords on scientific research collaboration. The results are shown in Table 6.
                Table. 6. The impact of the co-word on collaboration by year
Year   X&Y       X &~Y       ~X& Y      ~X&~Y           P(X|~Y)           P(X|Y)              PI
2007      0          33        0           2047         1.5865E-02            0               -1
2008      1         217        8           80375        2.6926E-03       1.1111E-01          40.27
2009     34         241       1052        138329        1.7392E-03       3.1308E-02          17.00
2010     95         268       3727        244070        1.0968E-03       2.4856E-02          21.66
2011    139         323       8327        346957        9.3009E-04       1.6419E-02          16.65
2012    189         367      20044        523846        7.0010E-04       9.3412E-03          12.34
2013    228         456      26193        760008        5.9963E-04       8.6295E-03          13.39
2014    290         452      42131        874462        5.1662E-04       6.8362E-03          12.23
2015    352         546      58446       1087511        5.0181E-04       5.9866E-03          10.93
2016    341         400      53734        902561        4.4299E-04       6.3061E-03          13.24
2017    313         330      58653        732615        4.5024E-04       5.3081E-03          10.79
2018    349         208      58258        494011        4.2087E-04       5.9549E-03          13.15
2019    138          58       9827         66613        8.6994E-04       1.3848E-02          14.92
        As can be seen from Table 6, for each time slice of the collaboration network,
   the use of the same keyword has a causal effect on collaboration, and the co-word
   increases the probability of collaboration by about 10-20 times. Not as obvious as the
   causal effect of the same organization. This may also be due to the fact that the
   authors tend to use some common keywords in the small areas of our research. These
   common keywords that are used too frequently have weakened the impact of keyword
   co-occurrence on collaboration.

   Reference information
         This part discusses the citations on research collaboration. Let event X: there is
   collaboration between authors and Y: there is citation between authors,that is author a
   cite some documents of author b or author b cite some documents of author a. Using
   the method similar to section 3.2.1, the causal effect of citations on scientific research
   collaboration is calculated annually. . The results are shown in Table 7.
                   Table. 7. The impact of the citations on collaboration by year
   Year    X&Y      X &~Y      ~X& Y      ~X^~Y         P(X|~Y)         P(X|Y)          PI
   2007    0            33          0    2047        1.5865E-02             0           -1
   2008    0           218          0    80383       2.7047E-03             0           -1
   2009    3           272         32    139349      1.9481E-03      8.5714E-02    43.00
   2010    44          319        228    247569      1.2869E-03      1.6176E-01    124.70
   2011    115         347        825    354459      9.7800E-04      1.2234E-01    124.09
                                                                                              13


2012     145           411        1537    542353     7.5724E-04       8.6207E-02     112.84
2013     213           471        2518    783683     6.0065E-04       7.7993E-02     128.85
2014     259           483        4132    912461     5.2906E-04       5.8984E-02     110.49
2015     312           586        5165    1140792    5.1341E-04       5.6965E-02     109.95
2016     325           416        4527    951768     4.3689E-04       6.6983E-02     152.32
2017     294           349        5209    786059     4.4379E-04       5.3425E-02     119.38
2018     320           237        5521    546748     4.3328E-04       5.4785E-02     125.44
2019     126            70         728    75712      9.2370E-04       1.4754E-01     158.73
     It can be seen from Figure 6 that for each time slice of the collaboration network,
the causal effect of citations on collaboration is strong, and the citations increase the
probability on collaboration by about 100 times. Not as obvious as the causal effect of
the same institution, but obviously stronger than the effect of keyword.

3.3     Research on performance and efficiency of network embedding based
        collaboration prediction algorithms

     Divide the data set into training set and test set according to the ratio of
80%-20%, 60%-40%, and 40%-60%, and discuss the performance of the seven
algorithms introduced in section 2.3. Each algorithm takes the same embedding
dimension, and the 4 numbers in each cell represent the (ROC_AUC, AUC,
F1_Score, Run time) introduced in Section 2.3.
           Table. 8. Performance and efficiency of embedding algorithms
Algorithm      dataset1(80% training      dataset2(60% training      dataset3(40% training
               set,20% test set)          set,40% test set)          set,60% test set)
ProNE          (0.8548, 0.7638, 0.7248,   (0.8181, 0.6891, 0.6459,   (0.8079, 0.6600, 0.6278,
               16.9s)                     16.13s)                    16.68s)
NetMF          (0.8686, 0.7554, 0.7166,   (0.8370, 0.7079 ,          (0.8152, 0.6550, 0.6270,
               19.3s)                     0.6683, 21.89s)            23.27s)
Hope           (0.4760, 0.3014, 0.2643,   (0.5559, 0.3355, 0.2998,   (0.5910, 0.3760, 0.3562,
               33.7s)                     26.26s)                    24.31s)
LINE           (0.8833, 0.7906, 0.7439,   (0.8376, 0.7179, 0.6715,   (0.8150, 0.6679, 0.6358,
               2203s)                     1788s                      1297s)
Node2vec       (0.8049 ,0.4355, 0.3379,   ( 0.7592 , 0.3895,         (0.7576, 0.3563, 0.3099,
               404.6s)                    0.3238, 326.5s)            244.0s)
Deepwalk       (0.8062, 0.4431, 0.3379,   (0.7628, 0.3960, 0.3174,   (0.7528, 0.3548, 0.3083,
               340.4s)                    257.0s)                    219.0s)
SDNE           (0.5096, 0.3357, 0.2888,   (0.5184, 0.3517, 0.3110,   (0.5541, 0.3789, 0.3514,
               1661s)                     974.0s)|                   609.8s)
     As can be seen from Table 8 ,the larger the proportion of the training set, the
higher the effect of various algorithms. Algorithmic comparison. It seems that the
LINE model has the best accuracy, with the highest ROC_AUC, AUC, and F1_Score
values, but the running time of the LINE model is too long, which is hundreds of
times the running time of fast models such as ProNE and NetMF. The time efficiency
of classic random walk algorithms such as Deepwalk and Node2vec is moderate, but
the accuracy is slightly worse in this example, which may be related to parameter
                                                                                            14


selection. The accuracy of matrix factorization models such as ProNE and NetMF is
not much different from the LINE model, but the time efficiency is improved by
hundreds of times. Especially for the NetMF model, the accuracy is very small and
the time efficiency is very high. Therefore, the following experiments choose NetMF
as the network representation learning model.

3.4    Results of the multiple features fused cooperation prediction methods
       based on NetMF

     In this part, we select the collaboration data of a certain year as the positive
examples of the test data, randomly generate the same number of negative examples
as the test positive examples, and select the collaboration data, institutional data, and
author co-word data before that year (excluding the year). The cross-citation data is
fused as a training set. Several types of feature fusion methods discussed in Section
2.4 are used for experiments. Table 9 shows the results of single feature fusion, Table
10 shows the results of multiple features fusion, and the three numbers in the cell in
the table indicate (ROC_AUC, F1_Score, AUC), the best result in each row is
expressed in bold.

           Table. 9.    Single feature fusion collaboration prediction results
Year    Raw Network            Raw+Institution      Raw+Keyword          Raw+Citation
2009    (0.7053, 0.6757,       (0.8551, 0.8093,     (0.7047, 0.6785,     (0.7287, 0.7084,
        0.8176)                0.9143)              0.8157)              0.8311)
2010    (0.7579, 0.6975,       (0.8603, 0.8086,     (0.7278, 0.6872,     (0.7794, 0.7469,
        0.8279)                0.9018)              0.8085)              0.8461)
2011    (0.7644, 0.688,        (0.8767, 0.7926,     (0.7497, 0.6897,     (0.785, 0.7099,
        0.8091)                0.8922)              0.8018)              0.8216)
2012    (0.7389, 0.657,        (0.854, 0.7746,      (0.718, 0.6515,      (0.7808, 0.7206,
        0.7694)                0.869)               0.7564)              0.8089)
2013    (0.7471, 0.6674,       (0.8683, 0.7813,     (0.7476, 0.6834,     (0.8232, 0.7528,
        0.7708)                0.8744)              0.7754)              0.8341)
2014    (0.7231, 0.6385,       (0.8684, 0.7719,     (0.7255, 0.6531,     (0.8273, 0.749,
        0.7214)                0.8616)              0.7408)              0.8307)
2015    (0.7301, 0.6347,       (0.8630, 0.7861,     (0.7554, 0.6655,     (0.8371, 0.7456,
        0.7151)                0.8648)              0.7549)              0.8296)
2016    (0.749, 0.6615,        (0.8977, 0.8225,     (0.8024, 0.7059,     (0.8688, 0.7833,
        0.7011)                0.8766)              0.7742)              0.8463)
2017    (0.756, 0.6635,        (0.9041, 0.8276,     (0.7938, 0.6989,     (0.8801, 0.7804,
        0.6905)                0.8729)              0.7574)              0.8429)
2018    (0.8072, 0.698,        (0.9132, 0.8268,     (0.8489, 0.7463,     (0.9225, 0.8282,
        0.7212)                0.8743)              0.7988)              0.8807)
2019    (0.8471, 0.6196,       (0.9501, 0.8275,     (0.9073, 0.7373,     (0.9678, 0.8118,
        0.6761)                0.8343)              0.7815)              0.882)

           Table. 10.      Multi-feature fusion collaboration prediction results
                                                                                      15


Year     Raw+I+K            Raw+K+C             Raw+I+C            Raw+I+K+C
2009     (0.8374, 0.7820,   (0.7301, 0.7003,    (0.8550, 0.8038,   (0.8547, 0.8147,
         0.9064)            0.8286)             0.9154)            0.9150)
2010     (0.8627, 0.7963,   (0.7547, 0.7181,    (0.8652, 0.8107,   (0.8826, 0.8292,
         0.9012)            0.8277)             0.9041)            0.9140)
2011     (0.8542, 0.7723,   (0.7936, 0.7184,    (0.8833, 0.8078,   (0.8842, 0.8111,
         0.8782)            0.8312)             0.9009)            0.8992)
2012     (0.8496, 0.7621,   (0.7697, 0.6888,    (0.8792, 0.8147,   (0.8804, 0.7911,
         0.8637)            0.7920)             0.8924)            0.8883)
2013     (0.8726, 0.7745,   (0.8198, 0.7472,    (0.903, 0.8269,    (0.8908, 0.8132,
         0.8696)            0.8293)             0.9035)            0.8904)
2014     (0.8375, 0.7375,   (0.804, 0.7000,     (0.9028, 0.826,    (0.8962, 0.8021,
         0.8316)            0.7995)             0.8953)            0.8829)
2015     (0.8541, 0.7368,   (0.8194, 0.7201,    (0.9166, 0.8257,   (0.8965, 0.7923,
         0.8359)            0.8063)             0.9011)            0.8788)
2016     (0.8887, 0.7781,   (0.8495, 0.7472,    (0.9406, 0.8648,   (0.9187, 0.8328,
         0.8536)            0.8161)             0.9194)            0.8894)
2017     (0.8854, 0.7627,   (0.8685, 0.7438,    (0.9355, 0.8524,   (0.9301, 0.8076,
         0.8361)            0.8139)             0.9041)            0.8801)
2018     (0.9003, 0.7678,   (0.9032, 0.7893,    (0.9583, 0.8644,   (0.9388, 0.8148,
         0.8357)            0.8455)             0.9193)            0.8836)
2019     (0.9344, 0.7333,   (0.9329, 0.7373,    (0.9657, 0.8196,   (0.9438, 0.7529,
         0.7811)            0.7862)             0.8739)            0.8034)

      Combining Table 9 and Table 10, it can be clearly seen that (1) Almost all
feature fusion cooperative prediction methods are larger than the original network in
ROC_AUC, F1_Score, and AUC, indicating that the feature fusion method can
improve the accuracy of cooperative prediction. (2) Same institution (I)> reference
relationship (K)> same keyword (C), which is exactly the same as the causal effect
sequence of the several types of relationships discussed in Section 3.2. (3) The
accuracy comparison of multiple features fusion methods is generally I+C> I+K+C>
I+K> K+C. In the case of simple network nodes and relationships, the I+K+C
three-feature fusion method is better than I+C two features. Multiple features can
bring more relationships and improve the prediction effect. There are relatively many
data nodes later, and the prediction result is close to the upper limit of the
predictability of the network. Adding keyword co-occurrence features may cause
more confusion in the network due to frequently occurring words, and the prediction
effect will not be further improved.

3.5    Collaboration prediction results

     We randomly select scientific researchers and predict collaborators for him/her.
In this example, we selected the researchers Lin Mingyan of Albert Einstein Coll
Med, and used the NetMF of Raw+I+C feature fusion with better results in the
                                                                                         16


experiment in Section 3.4. Table 11 lists the top 20 authors with the highest
probability of collaboration in the future.

               Table. 11. Authors with the top 20 collaboration probability
Author           Institution            sim      Author          Institution.      Sim
Zheng,           Albert Einstein Coll   0.9977   Delahaye,       Albert Einstein   0.9783
Deyou            Med                             Fabien          Coll Med
Pedrosa,         Albert Einstein Coll   0.9967   Rockowitz,      Albert Einstein   0.9768
Erika            Med                             Shira           Coll Med
Chen, Jian       Albert Einstein Coll   0.9965   Wijetunga, N.   Albert Einstein   0.9643
                 Med                             Ari             Coll Med
Zhao, Dejian     Albert Einstein Coll   0.9937   Pal, Rajarshi   Manipal Univ      0.9590
                 Med                                             Branch Campus
Wang, Ping       Albert Einstein Coll   0.9928   Carromeu,       Univ Calif San    0.9240
                 Med                             Cassiano        Diego
Xue, E.          Albert Einstein Coll   0.9924   Marchetto,      Salk Inst Biol    0.9225
                 Med                             Maria C. N.     Studies
Sharma, V.       Albert Einstein Coll   0.9924   Zhou, Li        Albert Einstein   0.9148
P.               Med                                             Coll Med
Abrajano,        Albert Einstein Coll   0.9904   Jaffe, Andrew   Lieber Inst       0.8058
Joseph J.        Med                             E.              Brain Dev
Guo, Xingyi      Albert Einstein Coll   0.9891   Lei, Mingxing   Univ So Calif     0.7802
                 Med
Qureshi,         Albert Einstein Coll   0.9851   Will, Britta    Albert Einstein   0.7758
Irfan A.         Med                                             Coll Med
     The first few authors in the results seem to be the first few authors who have
worked closely with the author, and the chances of continuing to cooperate in the
future are also very high. In the results of the collaboration recommendation, the
proportion of the same institution as the author is very large, accounting for 75%,
which is also consistent the law of scientific research collaboration.


4         Conclusion and Discussion

     This paper constructs the collaboration prediction method based on
heterogeneous information fused network embedding, and conducts an empirical
analysis in the field of stem cells.
     (1) Construct a author-centered heterogeneous information fusion schema, based
on information fusion theory. The predictability of the scientific research
collaboration network and the effects of institutions, co-word, and citation
information on collaboration are discussed. Experiments show that the they have an
impact on collaboration. The average causal effect analysis shows that the influence
order of the three factors is as follows(same institution> citation> keywords).
     (2) The accuracy and efficiency of the network representation learning methods
for collaboration are discussed. Experiments show that the accuracy and
                                                                                          17


computational efficiency of the comprehensive method, and the graph representation
learning method based on the new matrix factorization (like NetMF) has achieved
good results.
     (3) Construct a prediction method for scientific research collaboration based on
heterogeneous information fusion, and conduct empirical analysis in a yearly
network. Experiments show that the method of multiple features fusion can greatly
improve the accuracy of collaboration prediction. In terms of feature combination,
The combination of the same institution + citation relationship has achieved
outstanding results.

      Future research will build some more complicated causal diagram under the
author-centered of information fusion framework, explore the causal effects of feature
combinations; explore more detailed methods for selecting relationships within
features, and continuously explore relationships in various information, and improve
the identification effect. Expand the framework of information fusion, and introduce
other invisible features such as research fields, research topics, and writing styles into
the framework of collaboration prediction, and continuously enrich the connotation of
the methods.


References

1.   Newman M.E.J. Coauthorship Networks and Patterns of Scientific Collaboration.
     Proceedings of the National Academy of the United States of America,(101) : 5200-5205
     (2004)
2.   Liben‐Nowell D, Kleinberg J. The link‐prediction problem for social networks. Journal of
     the American society for information science and technology, 58(7): 1019-1031(2007).
3.   Lü Linyuan. Link Prediction in Complex Networks. Journal of University of Electronic
     Science and Technology of China, 39(05):651-661(2010).
      (吕琳媛.复杂网络链路预测.电子科技大学学报,39(05):651-661.(2010))
4.   Guns R, Rousseau R. Recommending research collaborations using link prediction and
     random forest classifiers. Scientometrics, 101(2): 1461-1473(2014).
5.   Yan E, Guns R. Predicting and recommending collaborations: An author-, institution-,
     and country-level analysis. Journal of Informetrics, 8(2): 295-309(2014).
6.   Shan Songyan, Wu Zhenxin. Review on the author similarity algorithm in the field of
     author name disambiguation and research collaboration prediction .Journal of Northeast
     Normal University(Natural Science Edition), ,51(02):71-80(2019).
      (单嵩岩,吴振新.面向作者消歧和合作预测领域的作者相似度算法述评.东北师大学
       报(自然科学版),51(02):71-80(2019).)
7.   Zhang Jinzhu, Yu Wenqian, Liu Jingjie, Wang Yue. Predicting Research Collaborations
     Based on Network Embedding.. Journal of the China Society for Scientific and Technical
     Information,37(02): 132- 139 (2018).
       (张金柱,于文倩,刘菁婕,王玥.基于网络表示学习的科研合作预测研究[J].情报学报,
       37(02): 132 -139,(2018) ).
                                                                                            18


8.     Yu Chuanming, Lin Aochen, Zhong Yunci, An Lu. Scientific Collaboration
       Recommendation Based on Network Embedding. Journal of the China Society for
       Scientific and Technical Information.38(05): 500-511(2019).
      (余传明,林奥琛,钟韵辞,安璐.基于网络表示学习的科研合作推荐研究[J].情报学报
      38(05): 500 - 511(2019)).
9.     Wang Zhibing, Han wenmin, Sun Zhumei, Pan xuelian.Research on Scientific
       Collaboration Prediction Based on the Combination of Network Topology and Node
       Attributes. Information Studies: Theory & Application, (08):116-120+109(2019).
      (汪志兵,韩文民,孙竹梅,潘雪莲.基于网络拓扑结构与节点属性特征融合的科研合作预
      测研究.情报理论与实践,(08):116-120+109(2019)).
10.    Liu P, Zheng K , Zou D. Research on Recommendation S&T colleboration based on LDA
       model. Information Studies: Theory &Application.38(9): 79-85(2015).
      (刘萍, 郑凯伦, 邹德安. 基于LDA模型的科研合作推荐研究.情报理论与实践, 38(9):
      79-85(2015)).
11.    Yu C, Gong Y,Zhao S,et al. Collaboration Recommendation of Finance Research Based
       on Multi-feature Fusion. Data Analysis and Knowledge Discovery,(8): 39-47(2017).
      (余传明, 龚雨田, 赵晓莉, 等. 基于多特征融合的金融领域科研合作推荐研究. 数据
      分析与知识发现, (8): 39-47(2017)).
12.    Lin Y,Wang K,Liu H,et al. Application of Network Representation Learning in the
       Prediction of Scholar Academic collaboration.Journal of the China Society for Scientific
       and Technical Information, 39(04):367-373(2020).
      (林原,王凯巧,刘海峰,许侃,丁堃,孙晓玲.网络表示学习在学者科研合作预测中的应用
      研究.情报学报, 39(04):367-373(2020)).
13.    Zhang X,Wen Y,Xu H. A Fusion Model of Network Representation Learning and Topic
       Model for Author collaboration Prediction. Data Analysis and Knowledge Discovery
       (2021).
      (张鑫,文奕,许海云.一种融合表示学习与主题表征的作者合作预测模型.数据分析与知
      识发现: 1-19(2021)).
14.    Xu H, Dong K, Wei L et al. Research on Multi-source Data Fusion Method in
       Scientometrics. Journal of the China Society for Scientific and Technical
       Information,37(03):318- 328(2018).
      (许海云,董坤,隗玲,王超,岳增慧.科学计量中多源数据融合方法研究述评.情报学报,
      37(03):318- 328(2018)).
15.    Hua B,Li G. Discussion on Theory and Application of Multi-Source Information Fusion
       in Big Data Environment. Library and Information Service ,59(16):5-10 (2015).
      (化柏林,李广建.大数据环境下多源信息融合的理论与应用探讨.图书情报工作,
      2015,59(16):5-10.)
16.    Morris S A, Yen G G. Construction of bipartite and unipartite weighted networks from
       collections of journal papers. Physics, (2005).
17.    Perozzi B, Al-Rfou R, Skiena S. Deepwalk: Online learning of social
       representations//Proceedings of the 20th ACM SIGKDD international conference on
       Knowledge discovery and data mining. ACM:701-710(2014).
18.    Tang J, Qu M, Wang M, et al. Line: Large-scale information network
       embedding//Proceedings of the 24th international conference on world wide web.
       International World Wide Web Conferences Steering Committee,1067-1077(2015).
                                                                                          19


19.    Grover A, Leskovec J. node2vec: Scalable feature learning for networks//Proceedings of
       the 22nd ACM SIGKDD international conference on Knowledge discovery and data
       mining. ACM, 855-864(2016).
20.    Wang D, Peng C, Zhu W. Structural Deep Network Embedding// Acm Sigkdd
       International Conference on Knowledge Discovery & Data Mining. (2016).
21.    Ou M, Cui P, Pei J, et al. Asymmetric transitivity preserving graph
       embedding//Proceedings of the 22nd ACM SIGKDD international conference on
       Knowledge discovery and data mining. ACM: 1105-1114(2016).
22.    Qiu, Jiezhong , et al. "Network Embedding as Matrix Factorization: Unifying DeepWalk,
       LINE, PTE, and node2vec." the Eleventh ACM International Conference ACM, (2018).
23.    Qiu J, Dong Y, Ma H, et al. Netsmf: Large-scale network embedding as sparse matrix
       factorization//The World Wide Web Conference. ACM: 1509-1520(2019).
24.    Jie Zhang, Yuxiao Dong, Yan Wang,et al. ProNE: Fast and Scalable Network
       Representation Learning.//In Proceedings of the 28th International Joint Conference on
       Artificial Intelligence (IJCAI'19),(2019)
25.    Lerer A, Wu L, Shen J, et al. PyTorch-BigGraph: A Large-scale Graph Embedding
       System// Proceedings of the 2nd SysML Conference,(2019).
26.    Fey M, Lenssen J E. Fast graph representation learning with PyTorch Geometric. arXiv
       preprint arXiv:1903.02428, (2019).
27.    Zhu Z, Xu S, Tang J, et al. GraphVite: A High-Performance CPU-GPU Hybrid System
       for Node Embedding//The World Wide Web Conference. ACM, 2019: 2494-2504.
28.    Lü L. “Link Prediction in Complex Networks”. Journal of University of Electronic
       Science and Technology of China,39(05) , pp .651-661(2010).
29.    Newman, M. E J. Scientific collaboration networks. I. Network construction and
       fundamental results. Physical Review E , 64(1):016131(2001).
30.    Newman, M. E J . Scientific collaboration networks. II. Shortest paths, weighted
       networks, and centrality. Physical Review E Statal Nonlinear & Soft Matter Physics,
       64(1):016132 (2001).
31.    Xu X, Xu S,Zhu Y,et al,Link Predictability in Complex Networks. Complex Systems and
       Complexity Science,11(01): 41-47(2014).
      (许小可,许爽,朱郁筱,张千明.复杂网络中链路的可预测性.复杂系统与复杂性科学,
      11(01):41-47(2014)).
32.    Tan S,Qi M,Wu J et al. Link predictability of complex network from spectrum
       perspective. Acta Physica Sinica,69(08):188-197(2020).
      (谭索怡,祁明泽,吴俊,吕欣.复杂网络链路可预测性:基于特征谱视角.物理学
        报,69(08):188-197.(2020)).