=Paper=
{{Paper
|id=Vol-3745/paper19
|storemode=property
|title=Biomedical Association Inference on Pandemic Kowledge Graphs: A Comparative Study
|pdfUrl=https://ceur-ws.org/Vol-3745/paper19.pdf
|volume=Vol-3745
|authors=Mengjia Wu,Chao Yu,Jian Xu,Ying Ding,Yi Zhang
|dblpUrl=https://dblp.org/rec/conf/eeke/WuYX0Z24
}}
==Biomedical Association Inference on Pandemic Kowledge Graphs: A Comparative Study==
<pdf width="1500px">https://ceur-ws.org/Vol-3745/paper19.pdf</pdf>
<pre>
                         Biomedical association inference on pandemic knowledge
                         graphs: A comparative study⋆
                         Mengjia Wu1,* , Chao Yu2 , Jian Xu2 , Ying Ding3 and Yi Zhang1
                         1
                           Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, 15 Broadway, Ultimo,
                         NSW, Australia
                         2
                           School of Information Management, Sun Yat-sen University, Guangzhou, China
                         3
                           School of Information, University of Texas, Austin, TX, USA


                                          Abstract
                                          Acquiring insights and understanding from historical pandemics is crucial for reducing the likelihood of their recurrence. The utilization
                                          of knowledge graphs stands as an essential tool for researchers, with knowledge inference emerging as a prominent task within these
                                          graphs to deduce previously unidentified connections between entities. This study endeavors to construct a knowledge graph centered on
                                          pandemic research and to evaluate the efficacy of various mainstream methodologies in the context of biomedical association inference.
                                          Our findings indicate that techniques for graph representation hold significant promise in executing these tasks and heterogeneous graph
                                          representation techniques demonstrate high predicting accuracy. Nonetheless, the advancement in this area of research necessitates
                                          more refined experimental designs and the adoption of more adaptive learning strategies.

                                           Keywords
                                           Biomedical knowledge graph, graph representation, knowledge inference


                         1. Introduction                                                                                                     could predict relationships within this graph. By masking
                                                                                                                                             10% of the connections of each type, we applied five different
                         Biomedical entity association inference is a long-term task                                                         methods to the masked graph to identify the hidden connec-
                         for scientific researchers and industry practitioners to under-                                                     tions from an equal mix of randomly inserted non-existent
                         stand the relationships between biomedical entities and pro-                                                        connections. The findings revealed the diverse effectiveness
                         pose first-hand literature-based evidence for further investi-                                                      of these methods in identifying the obscured connections,
                         gations [1, 2]. Severe Acute Respiratory Syndrome (SARS),                                                           with HetGNN proven as the most effective. Nonetheless, the
                         Middle East Respiratory Syndrome (MERS) and Coronavirus                                                             flexibility and applicability of different graph representation
                         Disease 2019 (COVID-19), the three notorious pandemics in                                                           methods across varied contexts need enhancement. This
                         public health history, presented huge threats to human lives                                                        research illustrates the application of multiple prominent
                         and social stability [3, 4]. Uncovering knowledge inference                                                         methods in deducing associations in knowledge graphs and
                         from the pandemic knowledge foundation encompassing                                                                 verifies the precision of these methods.
                         tremendous coronavirus-related research articles published                                                             The following of this paper is organized as follows: We
                         in human history may bring insights to uncover the evo-                                                             introduced the pandemic knowledge graphs and examined
                         lutionary mechanisms of coronavirus for reducing public                                                             methods in the section Data and Method, followed by Ex-
                         uncertainties towards and developing precautions for future                                                         perimental Settings and Results. We concluded the study
                         infectious disease crises [5, 6]. However, the complexity, het-                                                     and anticipated some future directions in the section of
                         erogeneity and intricate associations of biomedical entities                                                        Discussion and Conclusions.
                         present a challenge in exploring newly emerging knowl-
                         edge.
                            Knowledge graphs, which are extensively used to depict                                                           2. Data and Method
                         intricate data relationships, serve as the foundation for ana-
                         lyzing and inferring associations [2, 6, 7, 8]. These graphs                                                        The integrative Biomedical Knowledge Hub (iBKH) is a
                         represent biomedical entities such as genes, diseases, chemi-                                                       knowledge graph dataset that curates the associations of
                         cals, and drugs as nodes, with their relationships illustrated                                                      11 categories of biomedical entities from 17 publicly avail-
                         as either directed or undirected edges, sometimes accompa-                                                          able data sources [9]. Using the iBKH as the global dataset,
                         nied by supplementary descriptive attributes. Leveraging                                                            we searched scholarly articles across PubMed using search
                         network analysis techniques, various methods have been                                                              strategies from [3] and cross-matched the search results to
                         introduced to investigate patterns of association and predict                                                       iBKH. By extracting the nodes and edges relevant to papers
                         previously unknown relationships.                                                                                   in the search results, we constructed a pandemic-specific
                            In this study, we developed a knowledge graph from schol-                                                        sub-graph of the iBKH dataset. The overall description of
                         arly articles on SARS, MERS, and COVID-19, comprising                                                               sub-graphs relevant to each pandemic is given in Table 1.
                         9,142 nodes and 81,707 connections. We conducted a valida-                                                             The pandemic graph is denoted as 𝐺 = (𝑉, 𝐸), and
                         tion test to assess how well various mainstream techniques
                                                                                                                                                                 𝑉 = {𝑉𝑑𝑖𝑠 , 𝑉𝑑𝑔 , 𝑉𝑔 }                 (1)
                         Joint Workshop of the 5th Extraction and Evaluation of Knowledge Entities
                         from Scientific Documents and the 4th AI + Informetrics (EEKE-AII2024),
                                                                                                                                                             𝑑𝑔
                                                                                                                                                       𝐸 = {𝐸𝑑𝑔    𝑑𝑖𝑠
                                                                                                                                                                , 𝐸𝑑𝑔     𝑔
                                                                                                                                                                       , 𝐸𝑑𝑔    𝑑𝑖𝑠
                                                                                                                                                                             , 𝐸𝑑𝑖𝑠    𝑔
                                                                                                                                                                                    , 𝐸𝑑𝑖𝑠 , 𝐸𝑔𝑔 }      (2)
                         April 23-24, 2024, Changchun, China and Online
                         *
                           Corresponding author.                                                                                                where 𝑉𝑑𝑖𝑠 , 𝑉𝑑𝑔 , and 𝑉𝑔 respectively represent the node
                         $ mengjia.wu@uts.edu.au (M. Wu); yuch25@mail3.sysu.edu.cn                                                           set of diseases, drugs and genes. 𝐸𝑗𝑖 (𝑖, 𝑗 ∈ {𝑑𝑖𝑠, 𝑑𝑔, 𝑔})
                         (C. Yu); issxj@mail.sysu.edu.cn (J. Xu); ying.ding@ischool.utexas.edu                                               denotes the edge set of associations between nodes of types
                         (Y. Ding); yi.zhang@uts.edu.au (Y. Zhang)                                                                           𝑖 and 𝑗. Entity association inference on this pandemic graph
                          0000-0003-3956-7808 (M. Wu); 0000-0003-4886-4708 (J. Xu);
                         0000-0003-2567-2009 (Y. Ding); 0000-0002-7731-0301 (Y. Zhang)
                                                                                                                                             aims to predict emerging associations between nodes in 𝑉
                                   © 2024 Copyright 2024 for this paper by its authors. Use permitted under Creative Commons License         that have not yet appeared in 𝐸.
                                   Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                                       124
    Table 1
    The basic information of pandemic knowledge graphs

                                                  SARS     MERS        COVID-19     Pandemic graph
                                   #Paper         9,991    1,494        281,569         293,054
                                    Drug           439      46           1,429           1,507
                                   Disease         522      94           1,841           1,814
                                    Gene          1,939     345          5,435           5,821
                                 Drug-drug         145      13           1,626           1,678
                                Drug-disease       951      59           9,085           9,381
                                 Drug-Gene         710       49          3,709           4,135
                               Disease-disease     148       26           928             939
                                Disease-gene      6,256     575         54,503          57,236
                                 Gene-gene        2,199     347          7,461           8,338


   There have been substantial efforts in the development                     geneous graph representation, which utilizes con-
of association inference methodologies. In this study, we se-                 trastive learning to derive node representations.
lected the following representative methods to experiment:

     • Random Walk with Restart (RWR) is a commonly                    3. Experiment settings
       used method for inferring relationships within
       graphs, particularly in the biomedical field. It mod-           The setup for the experiment is detailed in Figure 1. The
       els a random walking process that begins at node 𝑎              objective was to assess the efficacy of various algorithms
       and calculates the likelihood of reaching node 𝑏 as             in predicting associations between biomedical entities. To
       a measure of relevance between nodes 𝑎 and 𝑏. To                this end, a validation experiment was structured in the
       avoid the walk from becoming trapped in local areas,            following manner: From each category of edges, denoted
       it introduces a restart probability 𝑝, which allows             as 𝐸𝑗𝑖 where 𝑖, 𝑗 belong to the set 𝑑𝑖𝑠, 𝑑𝑔, 𝑔 (represent-
       the walk to restart from node 𝑎 at each step, thereby           ing disease, drug-gene, and gene respectively), 10% of
       ensuring broader exploration of the graph.                      the edges were randomly selected and removed. The re-
                                                                       sulting graph, with these edges removed, was labeled as
     • Resource allocation (RA) [10]: RA is a link predic-
                                                                       𝐺𝑚 = (𝑉, 𝐸𝑚 ). The edges that were removed are rep-
       tion algorithm that conceptualizes the graph as a
                                                                       resented by 𝑟𝐸 = 𝑟𝐸𝑗 𝑖 |𝑖, 𝑗 ∈ 𝑑𝑖𝑠, 𝑑𝑔, 𝑔, and these were
       transportation network, viewing edges as channels
                                                                       considered the ’true’ associations for the purposes of this
       for resource diffusion. Under this model, the likeli-
                                                                       experiment. In addition to this, an equivalent number of
       hood of forming a link between any two nodes is
                                                                       node pairs, which were not connected by edges in the orig-
       approximated by the total resources these nodes are
                                                                       inal graph 𝐺, were randomly chosen. These pairs are de-
       expected to receive through their shared neighbors.
                                                                       noted by 𝑛𝐸 = 𝑛𝐸𝑗 𝑖 |𝑖, 𝑗 ∈ 𝑑𝑖𝑠, 𝑑𝑔, 𝑔, 𝑛𝐸𝑗 𝑖 ∩ 𝐸𝑗 𝑖 = ∅,
       This approach leverages the idea that the more re-
                                                                       and they were defined as the negative sample set for this
       sources two nodes can exchange via their common
                                                                       study. This methodical approach enabled a balanced evalua-
       connections, the higher the probability they will
                                                                       tion, comparing the algorithms’ abilities to correctly infer
       establish a direct link.
                                                                       both existing and non-existing associations, thereby provid-
     • Node2Vec [11]: Node2Vec is a scalable graph rep-
                                                                       ing a comprehensive understanding of their performance in
       resentation technique that utilizes random walks
                                                                       the context of biomedical entity association inference.
       to learn low-dimensional vector representations of
                                                                          Subsequently, each candidate algorithm was applied to
       nodes within a graph. It operates by optimizing an
                                                                       the modified graph 𝐺𝑚 to ascertain the likelihood of edge
       objective that aims to preserve neighborhood rela-
                                                                       formation between every pair of nodes within both 𝑟𝐸
       tionships, ensuring that nodes with similar network
                                                                       and 𝑛𝐸. In the cases of the Random Walk with Restart
       neighborhoods are close to each other in the vector
                                                                       (RWR) and resource allocation algorithms, this procedure
       space.
                                                                       involved computing the random walk probability and the
     • Heterogeneous graph neural networks (HetGNN)                    resource allocation score, respectively, for each node pair.
       [12]: HetGNN is a graph representation technique                Conversely, for the three graph representation techniques,
       designed to work with heterogeneous graphs, char-               the process entailed converting every node in the set 𝑉 into
       acterized by their inclusion of various types of nodes,         embedding vectors. The representation for edges was then
       each possessing diverse content attributes such as              determined through an average pooling strategy, which
       text and images. It introduces a novel two-step in-             involves aggregating the features of node embeddings to
       formation aggregation process aimed at effectively              form a single representation for each edge.
       learning from the information presented by neigh-                  Following the generation of these probabilities or repre-
       boring nodes, both of the same and different types.             sentations, the combined dataset of 𝑟𝐸 and 𝑛𝐸 was divided,
       This process allows HetGNN to capture the complex               with 80% allocated for training and the remaining 20% for
       structural and content heterogeneity of the graph,              testing. This division was employed to train a logistic re-
       enabling the model to generate more accurate and                gression classifier, the purpose of which was to predict the
       meaningful representations of each node.                        likelihood of edge formation between node pairs in the test
     • Heterogeneous graph neural network with co-                     set. The predictions made by the logistic regression model
       contrastive learning (HeCo) [13]: HeCo is a self-               were then used to calculate the Area Under the Curve (AUC)
       supervised learning technique designed for hetero-              metric for each method. By focusing exclusively on the test


                                                                 125
         Figure 1: The overall experiment design


Table 2                                                               training mechanisms being specifically designed for hetero-
Performance comparison of selected algorithms                         geneous networks, as seen in this research and commonly
                                                                      in biomedical entity graphs. These methods incorporate
     Method RWR       RA     Node2Vec HeCo HetGNN
                                                                      the significance of node types into the computation, em-
       𝑑𝑔
      𝐸𝑑𝑔    0.5827 0.5830    0.7257     -      0.9566                ploying either type-specific or metapath-based aggregation
       𝑑𝑖𝑠
      𝐸𝑑𝑔    0.7081 0.7651    0.8079     -      0.8315
                                                                      strategies for information. While this heterogeneity-focused
       𝑔
                                                                      approach is beneficial, it limits the model’s applicability
      𝐸𝑑𝑔    0.8298 0.8741    0.9250   0.9120   0.9584                and increases the cost of adaptation. Changes in the het-
       𝑑𝑖𝑠
      𝐸𝑑𝑖𝑠   0.7585 0.7893    0.7086     -      0.8495                erogeneous graph’s structure necessitate adjustments to
       𝑔                                                              HetGNN’s data inputs and HeCo’s metapaths, along with
      𝐸𝑑𝑖𝑠   0.5327 0.5410    0.7802   0.7990   0.8001
                                                                      significant methodological revisions. Additionally, HeCo’s
      𝐸𝑔𝑔    0.7561 0.8110    0.8327   0.8530   0.9050                performance is influenced by the setting of a positive sample
                                                                      threshold and the definition of metapaths, which vary per
                                                                      case and affects the outcome significantly. Node2Vec, in
data, which comprised 20% of the total dataset, a standard-           contrast, offers a more generalized solution applicable to a
ized evaluation criterion was established. This approach              wide range of graph types.
allowed for a fair comparison of the five candidate methods,             In conclusion, while heterogeneous graph representation
with the AUC metric serving as a measure of each method’s             methods hold promise for deducing relationships within
ability to accurately classify node pairs as either connected         pandemic knowledge graphs, enhancing their flexibility and
or not connected, based on the generated classification prob-         general applicability remains a challenge.
abilities.

                                                                      5. Discussion and Conclusions
4. Results
                                                                      This study explores the performance of different methods of
Table 2 presents the AUC scores for the five candidate meth-          association inference and provides insights into the poten-
ods. It is noted that HeCo needs a metapath definition to             tial of graph representation methods. Despite some existing
function, and a gene-based metapath was chosen for this               entity-relationship summarization tools like PubTator 3 [14],
purpose. Consequently, HeCo’s evaluation was limited to               graph representation methods still hold the potential to infer
gene-related associations. It was found that HetGNN outper-           more accurate biomedical associations but need improve-
formed others in recovering the removed links.Compared                ment on adaptability and generalisability. Future work will
to RWR and RA, the three graph representation methods                 modify the inference framework and perform real-world
demonstrated better accuracy in identifying connections.              association inference on the built pandemic graph.
Yet, their advantage is not definitive because they utilize a            We anticipated the following future directions align-
supervised learning approach, requiring both positive and             ing with some limitations of the current study: 1) This
negative samples to train a classifier, whereas RWR and               study offered some preliminary understandings on selected
RA can be applied directly to the existing graph structure            baselines of graph representation learning in inferring the
without any pre-existing knowledge of it.                             pandemic knowledge graph, but further customized re-
   From the perspective of edge types, the analysis of gene-          development based on the unique features of the pandemic
drug and drug-drug connections showed superior outcomes.              knowledge graph to enhance its performance might be ben-
Importantly, both RWR and RA displayed similar levels of              eficial. 2) Investigating the scientific community of a pan-
effectiveness as graph representation techniques in the task          demic and its collaborative patterns will bring insights to
of deducing disease-disease associations. This suggests that          analyze the societal context of a pandemic crisis and provide
inferring disease similarities might be distinct from other           evidence-based decision support in terms of science policy,
tasks, meriting additional investigation.                             public health, and public administration.
   Among the graph representation strategies, two methods
tailored for heterogeneous networks achieved superior AUC
scores over Node2Vec. This superiority results from their


                                                                126
Acknowledgments                                                          powered literature resource for unlocking biomedical
                                                                         knowledge, arXiv preprint arXiv:2401.11048 (2024).
This work was supported by the Commonwealth Scientific
and Industrial Research Organization (CSIRO), Australia, in
conjunction with the National Science Foundation (NSF) of
the United States, under CSIRO-NSF #2303037.


References
 [1] S. Henry, B. T. McInnes, Literature based discovery:
     Models, methods, and trends, Journal of Biomedical
     Informatics 74 (2017) 20–32.
 [2] M. Wu, Y. Zhang, G. Zhang, J. Lu, Exploring the genetic
     basis of diseases through a heterogeneous bibliometric
     network: A methodology and case study, Technologi-
     cal Forecasting and Social Change 164 (2021) 120513.
 [3] M. Haghani, M. C. Bliemer, Covid-19 pandemic and
     the unprecedented mobilisation of scholarly efforts
     prompted by a health crisis: Scientometric compar-
     isons across sars, mers and 2019-ncov literature, Sci-
     entometrics 125 (2020) 2695–2726.
 [4] Y. Zhang, X. Cai, C. V. Fry, M. Wu, C. S. Wagner, Topic
     evolution, disruption and resilience in early covid-19
     research, Scientometrics 126 (2021) 4225–4253.
 [5] A. L. Porter, Y. Zhang, Y. Huang, M. Wu, Tracking and
     mining the covid-19 research literature, Frontiers in
     Research Metrics and Analytics 5 (2020) 594060.
 [6] M. Wu, Y. Zhang, M. Markley, C. Cassidy, N. Newman,
     A. Porter, Covid-19 knowledge deconstruction and
     retrieval: An intelligent bibliometric solution, Scien-
     tometrics (2023) 1–31.
 [7] M. Wu, Y. Zhang, M. Grosser, S. Tipper, D. Venter,
     H. Lin, J. Lu, Profiling covid-19 genetic research: A
     data-driven study utilizing intelligent bibliometrics,
     Frontiers in Research Metrics and Analytics 6 (2021)
     683212.
 [8] K. Guo, M. Wu, Z. Soo, Y. Yang, Y. Zhang, Q. Zhang,
     H. Lin, M. Grosser, D. Venter, G. Zhang, et al., Artificial
     intelligence-driven biomedical genomics, Knowledge-
     Based Systems (2023) 110937.
 [9] C. Su, Y. Hou, M. Zhou, S. Rajendran, J. R. Maasch,
     Z. Abedi, H. Zhang, Z. Bai, A. Cuturrufo, W. Guo,
     et al., Biomedical discovery through the integrative
     biomedical knowledge hub (ibkh), Iscience 26 (2023).
[10] T. Zhou, L. Lü, Y.-C. Zhang, Predicting missing links
     via local information, The European Physical Journal
     B 71 (2009) 623–630.
[11] A. Grover, J. Leskovec, node2vec: Scalable feature
     learning for networks, in: Proceedings of the 22nd
     ACM SIGKDD International Conference on Knowl-
     edge Discovery and Data Mining, 2016, pp. 855–864.
[12] C. Zhang, D. Song, C. Huang, A. Swami, N. V. Chawla,
     Heterogeneous graph neural network, in: Proceedings
     of the 25th ACM SIGKDD International Conference
     on Knowledge Discovery & Data Mining, 2019, pp.
     793–803.
[13] X. Wang, N. Liu, H. Han, C. Shi, Self-supervised het-
     erogeneous graph neural network with co-contrastive
     learning, in: Proceedings of the 27th ACM SIGKDD
     Conference on Knowledge Discovery & Data Mining,
     2021, pp. 1726–1736.
[14] C.-H. Wei, A. Allot, P.-T. Lai, R. Leaman, S. Tian, L. Luo,
     Q. Jin, Z. Wang, Q. Chen, Z. Lu, Pubtator 3.0: An ai-


                                                                   127

</pre>