=Paper=
{{Paper
|id=Vol-3745/paper19
|storemode=property
|title=Biomedical Association Inference on Pandemic Kowledge Graphs: A Comparative Study
|pdfUrl=https://ceur-ws.org/Vol-3745/paper19.pdf
|volume=Vol-3745
|authors=Mengjia Wu,Chao Yu,Jian Xu,Ying Ding,Yi Zhang
|dblpUrl=https://dblp.org/rec/conf/eeke/WuYX0Z24
}}
==Biomedical Association Inference on Pandemic Kowledge Graphs: A Comparative Study==
Biomedical association inference on pandemic knowledge graphs: A comparative studyβ Mengjia Wu1,* , Chao Yu2 , Jian Xu2 , Ying Ding3 and Yi Zhang1 1 Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, 15 Broadway, Ultimo, NSW, Australia 2 School of Information Management, Sun Yat-sen University, Guangzhou, China 3 School of Information, University of Texas, Austin, TX, USA Abstract Acquiring insights and understanding from historical pandemics is crucial for reducing the likelihood of their recurrence. The utilization of knowledge graphs stands as an essential tool for researchers, with knowledge inference emerging as a prominent task within these graphs to deduce previously unidentified connections between entities. This study endeavors to construct a knowledge graph centered on pandemic research and to evaluate the efficacy of various mainstream methodologies in the context of biomedical association inference. Our findings indicate that techniques for graph representation hold significant promise in executing these tasks and heterogeneous graph representation techniques demonstrate high predicting accuracy. Nonetheless, the advancement in this area of research necessitates more refined experimental designs and the adoption of more adaptive learning strategies. Keywords Biomedical knowledge graph, graph representation, knowledge inference 1. Introduction could predict relationships within this graph. By masking 10% of the connections of each type, we applied five different Biomedical entity association inference is a long-term task methods to the masked graph to identify the hidden connec- for scientific researchers and industry practitioners to under- tions from an equal mix of randomly inserted non-existent stand the relationships between biomedical entities and pro- connections. The findings revealed the diverse effectiveness pose first-hand literature-based evidence for further investi- of these methods in identifying the obscured connections, gations [1, 2]. Severe Acute Respiratory Syndrome (SARS), with HetGNN proven as the most effective. Nonetheless, the Middle East Respiratory Syndrome (MERS) and Coronavirus flexibility and applicability of different graph representation Disease 2019 (COVID-19), the three notorious pandemics in methods across varied contexts need enhancement. This public health history, presented huge threats to human lives research illustrates the application of multiple prominent and social stability [3, 4]. Uncovering knowledge inference methods in deducing associations in knowledge graphs and from the pandemic knowledge foundation encompassing verifies the precision of these methods. tremendous coronavirus-related research articles published The following of this paper is organized as follows: We in human history may bring insights to uncover the evo- introduced the pandemic knowledge graphs and examined lutionary mechanisms of coronavirus for reducing public methods in the section Data and Method, followed by Ex- uncertainties towards and developing precautions for future perimental Settings and Results. We concluded the study infectious disease crises [5, 6]. However, the complexity, het- and anticipated some future directions in the section of erogeneity and intricate associations of biomedical entities Discussion and Conclusions. present a challenge in exploring newly emerging knowl- edge. Knowledge graphs, which are extensively used to depict 2. Data and Method intricate data relationships, serve as the foundation for ana- lyzing and inferring associations [2, 6, 7, 8]. These graphs The integrative Biomedical Knowledge Hub (iBKH) is a represent biomedical entities such as genes, diseases, chemi- knowledge graph dataset that curates the associations of cals, and drugs as nodes, with their relationships illustrated 11 categories of biomedical entities from 17 publicly avail- as either directed or undirected edges, sometimes accompa- able data sources [9]. Using the iBKH as the global dataset, nied by supplementary descriptive attributes. Leveraging we searched scholarly articles across PubMed using search network analysis techniques, various methods have been strategies from [3] and cross-matched the search results to introduced to investigate patterns of association and predict iBKH. By extracting the nodes and edges relevant to papers previously unknown relationships. in the search results, we constructed a pandemic-specific In this study, we developed a knowledge graph from schol- sub-graph of the iBKH dataset. The overall description of arly articles on SARS, MERS, and COVID-19, comprising sub-graphs relevant to each pandemic is given in Table 1. 9,142 nodes and 81,707 connections. We conducted a valida- The pandemic graph is denoted as πΊ = (π, πΈ), and tion test to assess how well various mainstream techniques π = {ππππ , πππ , ππ } (1) Joint Workshop of the 5th Extraction and Evaluation of Knowledge Entities from Scientific Documents and the 4th AI + Informetrics (EEKE-AII2024), ππ πΈ = {πΈππ πππ , πΈππ π , πΈππ πππ , πΈπππ π , πΈπππ , πΈππ } (2) April 23-24, 2024, Changchun, China and Online * Corresponding author. where ππππ , πππ , and ππ respectively represent the node $ mengjia.wu@uts.edu.au (M. Wu); yuch25@mail3.sysu.edu.cn set of diseases, drugs and genes. πΈππ (π, π β {πππ , ππ, π}) (C. Yu); issxj@mail.sysu.edu.cn (J. Xu); ying.ding@ischool.utexas.edu denotes the edge set of associations between nodes of types (Y. Ding); yi.zhang@uts.edu.au (Y. Zhang) π and π. Entity association inference on this pandemic graph 0000-0003-3956-7808 (M. Wu); 0000-0003-4886-4708 (J. Xu); 0000-0003-2567-2009 (Y. Ding); 0000-0002-7731-0301 (Y. Zhang) aims to predict emerging associations between nodes in π Β© 2024 Copyright 2024 for this paper by its authors. Use permitted under Creative Commons License that have not yet appeared in πΈ. Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 124 Table 1 The basic information of pandemic knowledge graphs SARS MERS COVID-19 Pandemic graph #Paper 9,991 1,494 281,569 293,054 Drug 439 46 1,429 1,507 Disease 522 94 1,841 1,814 Gene 1,939 345 5,435 5,821 Drug-drug 145 13 1,626 1,678 Drug-disease 951 59 9,085 9,381 Drug-Gene 710 49 3,709 4,135 Disease-disease 148 26 928 939 Disease-gene 6,256 575 54,503 57,236 Gene-gene 2,199 347 7,461 8,338 There have been substantial efforts in the development geneous graph representation, which utilizes con- of association inference methodologies. In this study, we se- trastive learning to derive node representations. lected the following representative methods to experiment: β’ Random Walk with Restart (RWR) is a commonly 3. Experiment settings used method for inferring relationships within graphs, particularly in the biomedical field. It mod- The setup for the experiment is detailed in Figure 1. The els a random walking process that begins at node π objective was to assess the efficacy of various algorithms and calculates the likelihood of reaching node π as in predicting associations between biomedical entities. To a measure of relevance between nodes π and π. To this end, a validation experiment was structured in the avoid the walk from becoming trapped in local areas, following manner: From each category of edges, denoted it introduces a restart probability π, which allows as πΈππ where π, π belong to the set πππ , ππ, π (represent- the walk to restart from node π at each step, thereby ing disease, drug-gene, and gene respectively), 10% of ensuring broader exploration of the graph. the edges were randomly selected and removed. The re- sulting graph, with these edges removed, was labeled as β’ Resource allocation (RA) [10]: RA is a link predic- πΊπ = (π, πΈπ ). The edges that were removed are rep- tion algorithm that conceptualizes the graph as a resented by ππΈ = ππΈπ π |π, π β πππ , ππ, π, and these were transportation network, viewing edges as channels considered the βtrueβ associations for the purposes of this for resource diffusion. Under this model, the likeli- experiment. In addition to this, an equivalent number of hood of forming a link between any two nodes is node pairs, which were not connected by edges in the orig- approximated by the total resources these nodes are inal graph πΊ, were randomly chosen. These pairs are de- expected to receive through their shared neighbors. noted by ππΈ = ππΈπ π |π, π β πππ , ππ, π, ππΈπ π β© πΈπ π = β , This approach leverages the idea that the more re- and they were defined as the negative sample set for this sources two nodes can exchange via their common study. This methodical approach enabled a balanced evalua- connections, the higher the probability they will tion, comparing the algorithmsβ abilities to correctly infer establish a direct link. both existing and non-existing associations, thereby provid- β’ Node2Vec [11]: Node2Vec is a scalable graph rep- ing a comprehensive understanding of their performance in resentation technique that utilizes random walks the context of biomedical entity association inference. to learn low-dimensional vector representations of Subsequently, each candidate algorithm was applied to nodes within a graph. It operates by optimizing an the modified graph πΊπ to ascertain the likelihood of edge objective that aims to preserve neighborhood rela- formation between every pair of nodes within both ππΈ tionships, ensuring that nodes with similar network and ππΈ. In the cases of the Random Walk with Restart neighborhoods are close to each other in the vector (RWR) and resource allocation algorithms, this procedure space. involved computing the random walk probability and the β’ Heterogeneous graph neural networks (HetGNN) resource allocation score, respectively, for each node pair. [12]: HetGNN is a graph representation technique Conversely, for the three graph representation techniques, designed to work with heterogeneous graphs, char- the process entailed converting every node in the set π into acterized by their inclusion of various types of nodes, embedding vectors. The representation for edges was then each possessing diverse content attributes such as determined through an average pooling strategy, which text and images. It introduces a novel two-step in- involves aggregating the features of node embeddings to formation aggregation process aimed at effectively form a single representation for each edge. learning from the information presented by neigh- Following the generation of these probabilities or repre- boring nodes, both of the same and different types. sentations, the combined dataset of ππΈ and ππΈ was divided, This process allows HetGNN to capture the complex with 80% allocated for training and the remaining 20% for structural and content heterogeneity of the graph, testing. This division was employed to train a logistic re- enabling the model to generate more accurate and gression classifier, the purpose of which was to predict the meaningful representations of each node. likelihood of edge formation between node pairs in the test β’ Heterogeneous graph neural network with co- set. The predictions made by the logistic regression model contrastive learning (HeCo) [13]: HeCo is a self- were then used to calculate the Area Under the Curve (AUC) supervised learning technique designed for hetero- metric for each method. By focusing exclusively on the test 125 Figure 1: The overall experiment design Table 2 training mechanisms being specifically designed for hetero- Performance comparison of selected algorithms geneous networks, as seen in this research and commonly in biomedical entity graphs. These methods incorporate Method RWR RA Node2Vec HeCo HetGNN the significance of node types into the computation, em- ππ πΈππ 0.5827 0.5830 0.7257 - 0.9566 ploying either type-specific or metapath-based aggregation πππ πΈππ 0.7081 0.7651 0.8079 - 0.8315 strategies for information. While this heterogeneity-focused π approach is beneficial, it limits the modelβs applicability πΈππ 0.8298 0.8741 0.9250 0.9120 0.9584 and increases the cost of adaptation. Changes in the het- πππ πΈπππ 0.7585 0.7893 0.7086 - 0.8495 erogeneous graphβs structure necessitate adjustments to π HetGNNβs data inputs and HeCoβs metapaths, along with πΈπππ 0.5327 0.5410 0.7802 0.7990 0.8001 significant methodological revisions. Additionally, HeCoβs πΈππ 0.7561 0.8110 0.8327 0.8530 0.9050 performance is influenced by the setting of a positive sample threshold and the definition of metapaths, which vary per case and affects the outcome significantly. Node2Vec, in data, which comprised 20% of the total dataset, a standard- contrast, offers a more generalized solution applicable to a ized evaluation criterion was established. This approach wide range of graph types. allowed for a fair comparison of the five candidate methods, In conclusion, while heterogeneous graph representation with the AUC metric serving as a measure of each methodβs methods hold promise for deducing relationships within ability to accurately classify node pairs as either connected pandemic knowledge graphs, enhancing their flexibility and or not connected, based on the generated classification prob- general applicability remains a challenge. abilities. 5. Discussion and Conclusions 4. Results This study explores the performance of different methods of Table 2 presents the AUC scores for the five candidate meth- association inference and provides insights into the poten- ods. It is noted that HeCo needs a metapath definition to tial of graph representation methods. Despite some existing function, and a gene-based metapath was chosen for this entity-relationship summarization tools like PubTator 3 [14], purpose. Consequently, HeCoβs evaluation was limited to graph representation methods still hold the potential to infer gene-related associations. It was found that HetGNN outper- more accurate biomedical associations but need improve- formed others in recovering the removed links.Compared ment on adaptability and generalisability. Future work will to RWR and RA, the three graph representation methods modify the inference framework and perform real-world demonstrated better accuracy in identifying connections. association inference on the built pandemic graph. Yet, their advantage is not definitive because they utilize a We anticipated the following future directions align- supervised learning approach, requiring both positive and ing with some limitations of the current study: 1) This negative samples to train a classifier, whereas RWR and study offered some preliminary understandings on selected RA can be applied directly to the existing graph structure baselines of graph representation learning in inferring the without any pre-existing knowledge of it. pandemic knowledge graph, but further customized re- From the perspective of edge types, the analysis of gene- development based on the unique features of the pandemic drug and drug-drug connections showed superior outcomes. knowledge graph to enhance its performance might be ben- Importantly, both RWR and RA displayed similar levels of eficial. 2) Investigating the scientific community of a pan- effectiveness as graph representation techniques in the task demic and its collaborative patterns will bring insights to of deducing disease-disease associations. This suggests that analyze the societal context of a pandemic crisis and provide inferring disease similarities might be distinct from other evidence-based decision support in terms of science policy, tasks, meriting additional investigation. public health, and public administration. Among the graph representation strategies, two methods tailored for heterogeneous networks achieved superior AUC scores over Node2Vec. This superiority results from their 126 Acknowledgments powered literature resource for unlocking biomedical knowledge, arXiv preprint arXiv:2401.11048 (2024). This work was supported by the Commonwealth Scientific and Industrial Research Organization (CSIRO), Australia, in conjunction with the National Science Foundation (NSF) of the United States, under CSIRO-NSF #2303037. References [1] S. Henry, B. T. McInnes, Literature based discovery: Models, methods, and trends, Journal of Biomedical Informatics 74 (2017) 20β32. [2] M. Wu, Y. Zhang, G. Zhang, J. Lu, Exploring the genetic basis of diseases through a heterogeneous bibliometric network: A methodology and case study, Technologi- cal Forecasting and Social Change 164 (2021) 120513. [3] M. Haghani, M. C. Bliemer, Covid-19 pandemic and the unprecedented mobilisation of scholarly efforts prompted by a health crisis: Scientometric compar- isons across sars, mers and 2019-ncov literature, Sci- entometrics 125 (2020) 2695β2726. [4] Y. Zhang, X. Cai, C. V. Fry, M. Wu, C. S. Wagner, Topic evolution, disruption and resilience in early covid-19 research, Scientometrics 126 (2021) 4225β4253. [5] A. L. Porter, Y. Zhang, Y. Huang, M. Wu, Tracking and mining the covid-19 research literature, Frontiers in Research Metrics and Analytics 5 (2020) 594060. [6] M. Wu, Y. Zhang, M. Markley, C. Cassidy, N. Newman, A. Porter, Covid-19 knowledge deconstruction and retrieval: An intelligent bibliometric solution, Scien- tometrics (2023) 1β31. [7] M. Wu, Y. Zhang, M. Grosser, S. Tipper, D. Venter, H. Lin, J. Lu, Profiling covid-19 genetic research: A data-driven study utilizing intelligent bibliometrics, Frontiers in Research Metrics and Analytics 6 (2021) 683212. [8] K. Guo, M. Wu, Z. Soo, Y. Yang, Y. Zhang, Q. Zhang, H. Lin, M. Grosser, D. Venter, G. Zhang, et al., Artificial intelligence-driven biomedical genomics, Knowledge- Based Systems (2023) 110937. [9] C. Su, Y. Hou, M. Zhou, S. Rajendran, J. R. Maasch, Z. Abedi, H. Zhang, Z. Bai, A. Cuturrufo, W. Guo, et al., Biomedical discovery through the integrative biomedical knowledge hub (ibkh), Iscience 26 (2023). [10] T. Zhou, L. LΓΌ, Y.-C. Zhang, Predicting missing links via local information, The European Physical Journal B 71 (2009) 623β630. [11] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining, 2016, pp. 855β864. [12] C. Zhang, D. Song, C. Huang, A. Swami, N. V. Chawla, Heterogeneous graph neural network, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 793β803. [13] X. Wang, N. Liu, H. Han, C. Shi, Self-supervised het- erogeneous graph neural network with co-contrastive learning, in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 1726β1736. [14] C.-H. Wei, A. Allot, P.-T. Lai, R. Leaman, S. Tian, L. Luo, Q. Jin, Z. Wang, Q. Chen, Z. Lu, Pubtator 3.0: An ai- 127