Initial achievements in relation extraction from RNA-focused scientific papers Emanuele Cavalleri1 , Mauricio Soto-Gomez1 , Ali Pashaeibarough1 , Dario Malchiodi1 , Harry Caufield2 , Justin Reese2 , Christopher J. Mungall2 , Peter N. Robinson4 , Elena Casiraghi1,2,3 , Giorgio Valentini1,3 and Marco Mesiti1,2,* 1 Department of Computer Science, Università di Milano, Via Celoria 18, 20133 Milano 2 Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA 3 ELLIS, European Laboratory for Learning and Intelligent Systems, Milan Unit, Italy 4 Berlin Institute of Health - Charité, Universitätsmedizin, Berlin, 13353, Germany Abstract Relation extraction from the scientific literature to comply with a domain ontology is a well-known problem in natural language processing and is particularly critical in precision medicine. The advent of large language models (LLMs) has paved the way for the development of new effective approaches to this problem, but the extracted relations can be affected by issues such as hallucination, which must be minimized. In this paper, we present the initial design and preliminary experimental validation of SPIREX, an extension of the SPIRES-based system for the extraction of RDF triples from scientific literature involving RNA molecules. Our system exploits schema constraints in the formulations of LLM prompts along with our RNA-based KG, RNA-KG, for evaluating the plausibility of the extracted triples. RNA-KG contains more than 9M edges representing different kinds of relationships in which RNA molecules can be involved. Initial experimental results on a controlled data set are quite encouraging. Keywords RNA-based Knowledge Graphs, relation discovery, LLM, Prompt Engineering, Link Prediction 1. Introduction Ribonucleic acid (RNA) plays a critical role in the central dogma of molecular biology, serving as the intermediary between DNA and proteins, the building blocks of life. Beyond its traditional role in protein synthesis, RNA is involved in a variety of cellular processes, including gene regulation and catalysis, highlighting its importance in understanding the complexities of biological systems. RNA-KG [1] is the first ontology-based knowledge graph for representing SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy * Corresponding author. $ emanuele.cavalleri@unimi.it (E. Cavalleri); mauricio.soto@unimi.it (M. Soto-Gomez); ali.pashaeibarough@unimi.it (A. Pashaeibarough); dario.malchiodi@unimi.it (D. Malchiodi); jhc@lbl.gov (H. Caufield); justaddcoffee@gmail.com (J. Reese); CJMungall@lbl.gov (C. J. Mungall); peter.robinson@bih-charite.de (P. N. Robinson); elena.casiraghi@unimi.it (E. Casiraghi); giorgio.valentini@unimi.it (G. Valentini); marco.mesiti@unimi.it (M. Mesiti)  0000-0003-1973-5712 (E. Cavalleri); 0000-0001-5977-9467 (M. Soto-Gomez); 0000-0002-0559-1992 (A. Pashaeibarough); 0000-0002-7574-697X (D. Malchiodi); 0000-0001-5705-7831 (H. Caufield); 0000-0002-2170-2250 (J. Reese); 0000-0002-6601-2165 (C. J. Mungall); 0000-0002-0736-9199 (P. N. Robinson); 0000-0003-2024-7572 (E. Casiraghi); 0000-0002-5694-3919 (G. Valentini); 0000-0001-5701-0080 (M. Mesiti) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings coding and non-coding RNA molecules and their interactions with other biomolecular data as well as with pathways, abnormal phenotypes and diseases to support the study and the discovery of the biological role of RNA. RNA-KG contains around 9M edges extracted from more than 50 public data sources and can be exploited to study RNA molecules and develop innovative graph algorithms to support knowledge discovery in data science. The manual ingestion of triples in a knowledge graph by expert curators is a time-consuming and costly operation and tools supporting them in the extraction of biological entities and their relationships from plain texts are highly demanding. The advent of LLMs [2] has paved the way for the development of new effective tools for this problem [3]. However, these techniques have shown different limitations, such as generating incorrect statements due to hallucinations (inaccurate, nonsensical, or irrelevant output in the given context) [4] and insensitivity to negations [5], that cannot be tolerated in sensitive domains like precision medicine. SPIRES (Structured Prompt Interrogation and Recursive Extraction of Semantics) [6] is a recently proposed knowledge extraction approach that exploits LLMs to identify instances of a knowledge schema expressed in terms of LinkML [7]. Since the schema contains a conceptualization of a given domain in terms of concepts, relationships, and properties we are interested in, it can be used for defining more effective LLM prompts. Additionally, SPIRES allows grounding of atomic textual elements as concepts taken from a variety of OBO Foundry ontologies [8]. Even if SPIRES has proven its efficiency in the extraction of triples from plain text according to bio-ontologies, there is the need to evaluate the reliability of the extracted triples both in terms of the generated identifiers (i.e. they correctly represent the identified entities) and the accuracy of their data source. In this paper, we address this problem by exploiting RNA-KG as a gold standard in the RNA world because it contains many interactions involving RNA molecules and can be used to evaluate the plausibility of the extracted triples. To leverage SPIRES for its ability to extracting triples from texts and supporting experts in their validation, we present SPIREX, a system for the extraction of reliable triples from scientific papers. There are two main backbones of the system. On one side, SPIRES and the LinkML representation of the RNA-KG schema [9] allow the extraction of RDF triples compliant with the domain Ontology. On the other side, we use RNA-KG as a gold standard providing knowledge about interactions involving RNA molecules and use link prediction techniques to validate the ’plausibility’ of the extracted triples; i.e., the likelihood of the triple to be part of RNA-KG. The initial experimental results on a manually curated testbed of 60 scientific texts are encouraging. 2. RNA-KG and SPIRES RNA-KG [1] is the first knowledge graph encompassing biological knowledge about RNAs gathered from more than 50 public databases, integrating functional relationships with genes, proteins, chemicals, and ontologically grounded biomedical concepts. The current release of RNA-KG has a single component containing around 600K nodes and 9M edges and can be queried via SPARQL endpoint at https://RNA-KG.anacleto.di.unimi.it. Nodes are usually mapped to reference biomedical vocabularies and ontologies such as NCBI Gene Entrez identifiers for uniquely identifying genes and many kinds of non-coding RNAs (ncRNAs), Human Phenotype Ontology (HPO [10]) for phenotypes, Monarch merged disease ontology (Mondo [11]) for species 0.4%(2,148) sequence 0.4%(2,363) other terms 0.4%(2,427) pathway 0.5%(2,606) vaccine 1.1%(6,246) sncRNA anatomy 2.5%(14,232) phenotype 2.9%(16,865) disease 4.0%(23,270) lncRNA 4.3%(24,749) 13.4% (77,741) GO mRNA 3.6%(20,632) 7.6% other viral RNA 1.0%(5,693) unclassified RNA 0.1%(692) (43,869) RO properties RO properties 7.6% cell introduced by 7.7% 85.9% involving RNA (44,175) the integration other bio-entities (44,399) (7,528,124) molecules 2.4% (213,259) 16.6% 9.2% (96,197) (806,681) subClassOf 25.9% (150,080) protein other properties 2.5% (220,518) chemical ontology terms 69.9% (404K) RNA 22.3% (130K) other 7.7% (44K) non-RO 11.7% (1,027,199) RO 88.3% (7,741,383) (a) Distribution of nodes. (b) Distribution of edges. Figure 1: (a) node distribution according to node types; (b) edge distribution according to edge types. diseases, and Gene Ontology (GO [12]) for annotating ncRNAs. Moreover, all the possible interactions are represented by means of the Relation Ontology (RO [13]). This ensures common semantics for the different relationships that can be extracted from the sources. Figure 1a shows the distribution of nodes contained in RNA-KG (details in [1]). Nodes can be classified into nodes representing ontological terms and bio-entities lacking a direct mapping to ontological terms. Bio-entities have been further subdivided into RNA nodes, and non-RNA nodes (named other bio-entities) that contain, for instance, gene and nodes describing genomics features (e.g., nucleotide substitution). Figure 1b shows the distribution of edges in RNA-KG. Edges have been subdivided into three categories: 𝑖) edges representing RO properties that characterize interactions among RNA molecules in the considered sources; 𝑖𝑖) other edges not belonging to RO properties; 𝑖𝑖𝑖) edges representing the subClassOf relationships. The edges of the last two categories are introduced from the integration of bio-ontologies into RNA-KG and the lack of a dedicated ontology for RNA molecules. When RNA molecules cannot be precisely mapped to a reference ontology, they are included as subClassOf an appropriate class within Sequence Ontology (SO [14]). SPIRES [6] is a recently proposed approach to information extraction that creates and refines prompts to maximize the effectiveness of LLMs by exploiting domain knowledge encapsulated through a schema expressed in LinkML [7]. By identifying and extracting relevant information from an input text, it adopts zero-shot or few-shot learning to identify and extract relevant enti- ties and relationships among them, which are then normalized and grounded through ontologies and vocabularies. SPIRES is a general-purpose approach that can be used across a variety of domains and does not require specific training/tuning on the considered domain. SPIRES adopts an engineering approach for creating prompts for interacting with an LLM (like GPT3, GPT4) to improve the quality of the generated responses through the use of domain-specific schema. In this way, technical challenges for generative AI (e.g., constructing comprehensive real-world miRNA protein n gene + id: String + id: String n n + id: Integer 1 n + description: String regulates activity of gene product of + description: String regulates activity of n 1 + type: String has gene product + sequence: String interacts with n n + synonym: String list n n + symbol: String regulates activity of n + family name: String + ortholog: String list n n + sequence: String causally related to disease n n + id: String + description: String n n causes or contributes to condition + synonym: String list causes or contributes to condition Figure 2: An excerpt of RNA-KG schema. knowledge and improving the accuracy of automated responses) can be addressed. The specification of this schema in LinkML contains the classes of entities and relationships among them within the specified domain. Classes can also include attributes (e.g., name, type, and list of synonyms) to enrich entity description. The LinkML schema is automatically processed to generate a list of prompts through which SPIRES interacts with a LLM. Each prompt of the list is submitted to the LLM for collecting information that is exploited for completing the following prompt by eventually considering the bio-ontologies (e.g., for changing a protein symbol with the corresponding identifier in an ontology). This recursive refinement process improves the quality of the information gathered through the LLM. 3. The SPIREX system As shown in the architecture in Figure 3, SPIREX is composed of two modules: the SPIRES module is used for extracting the RDF triples from scientific abstracts. Then, an embedding of RNA-KG is used to validate the generated triples and score their level of plausibility. SPIRES module for RNA-KG. Through the study of the scientific literature about RNA interactions, and the analysis of more than 50 data sources [1] all over the world, we have identified the kinds of relationships that can involve RNA molecules and reported them in a meta-graph [15]. Figure 2 shows an excerpt of the UML schema describing the entities that are connected to miRNA molecules through different kinds of relationships in the meta-graph. Starting from its LinkML representation, a list of prompts specific for the RNA domain are generated according to which entities and the relationships contained in a text are extracted by considering the schema constraints. Moreover, SPIRES adopts bio-ontology of our domain (details in [9]) for producing source and target identifiers according to the RNA-KG identification scheme and RO predicates. RNA-KG module for link prediction. The validation of new potential relations derived from the SPIRES module can be modeled as a link prediction task on RNA-KG, performed via either Graph Neural Networks (GNNs) or Random-Walk (RW) based methods for Graph Representation Learning. GNN approaches usually present scalability issues, while RW-based Figure 3: The SPIREX architecture. graph embedding overcomes this problem by the use of random-walk approaches that sample the graph to construct a representation of the nodes (and edges) in a lower dimensional vector space that feeds traditional ML models. In SPIREX we have used Node2vec [16] for the embedding of RNA-KG. Node2vec is a well- known random walk-based approach that aims to capture the graph topology from the node neighborhoods. The model generates a set of second order random walks across the graph, that are used to train a shallow neural network to compute a vector representation of the graph components. One of the key features of Node2vec is the possibility to generate paths that focus either on the local or global structure of the graph, providing a great flexibility in the graph representation. Our system uses the implementation available in GRAPE [17], a software resource specifically designed for the manipulation and embedding of large graphs. 4. Preliminary experimental results Experiments have been realized for both modules of SPIREX. For the first module, we evaluated the prediction accuracy of SPIRES in extracting triples in a set of manually annotated documents. We also compared SPIRES with base LLMs to verify the advantage of using LinkML in the specification of the domain schema. For the second module, we checked if the simple predictive model can generate reasonable scores on RNA-KG. Finally, we assessed the ability of the predictor to evaluate the plausibility of triples extracted through SPIRES according to RNA-KG. SPIRES prediction accuracy and comparison with base LLMs. As described in [9], a corpus of 60 scientific articles related to RNA molecules and their interactions has been gathered from PubMed, ResearchGate, and Google Scholar. Starting from them, we have identified abstracts, discussions, or specific subsections within the domain of interest. They have been manually annotated with the entities and the six kinds of interactions that can be extracted from them (reported in the y-axis of the diagram in Figure 4a). For evaluating the predictions, we have used standard metrics (precision, recall, and F-score) by considering the True Positive (TP), False Positive (FP), and False Negative (FN) according to the manually tagged paragraphs. As shown in Figure 4a, the obtained results, using GPT3.5- turbo in SPIRES for each category of interaction, indicate a consistent trend where TP rate tends to be higher with respect to both FP and FN rates. The only exception is for protein- 0.79 0.77 TP rate FP rate disease-disease FN rate 0.59 miRNA-disease 0.47 miRNA-miRNA 0.34 0.35 gene-protein 0.18 0.17 0.13 miRNA-gene 0.07 0.07 0.05 TP FP SPIRES- SPIRES- Llama 2 GPT3.5t protein-disease FN GPT4t GPT3.5t F-score Precision Recall 0 0.2 0.4 0.6 0.8 SPIRES-GPT4t 0.88 0.91 0.86 Rate SPIRES-GPT3.5t 0.86 0.94 0.81 TP FP FN F-score Precision Recall Llama 2 0.74 0.89 0.64 Total 304 41 134 0.76 0.88 0.69 GPT3.5t 0.64 0.73 0.57 (a) TP, FP, and FN results on 60 texts. (b) Comparing SPIRES, Llama, GPT on 20 texts. Figure 4: Evaluation of SPIRES on relation extraction involving protein, miRNA, disease, and gene entities and comparison against different LLMs. disease relations, where FN rate is higher than TP rate. We noticed that many protein-disease relations are undetected, often because they are expressed in complex ways and this can lead to inaccurate entity recognition. Despite this, the overall precision remains remarkably high and, in biomedicine, this is preferable because it prioritizes certainty over ambiguity. We have assessed the performance of SPIRES by considering as baseline approaches OpenAI GPT (ver. GPT3.5-turbo) and Llama 2 [18] (ver. llama-2-70b-chat). As back-end LLM of SPIRES, we have considered both GPT3.5-turbo and GPT4-turbo. We have manually grounded instances and relationships that can be extracted from 20 documents among those considered in the previous experiment. Regarding the prompt to be used with the base LLM system, we have considered a simple one requesting to extract triples from the considered text with an explicit request for mapping the extracted concepts to appropriate terminologies. Given that both OpenAI GPT and Llama 2 caution that the ontology identifiers provided are hypothetical and might not align with actual identifiers in the ontologies, and considering the general community advice against relying on IDs from an LLM [19], we decided to substitute the grounding process with our manually curated look-up tables [1]. As shown in Figure 4b, SPIRES outperforms baseline LLMs used alone both in terms of precision and recall. The histogram points out a high increment in TP rate and a decrease in FP and FN rates when adopting SPIRES for extracting relations that adhere to a specified schema within texts. Furthermore, when adopting GPT4-turbo in SPIRES the recall metric improves due to the lower FN rate with a positive effect on the F-score. Evaluation of the plausibility of SPIREX predictions. For evaluating the plausibility, a restricted RNA-KG view has been considered that roughly corresponds to the schema in Figure 2 focusing on the predictions of miRNA-disease relationships. More precisely, we have considered two different settings of the hold-out procedure to evaluate prediction performance. 8000 3000 Number of triples Number of triples 6000 2000 4000 1000 2000 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Prediction Probabilities Prediction Probabilities (a) (b) 30 40 30 Number of triples Number of triples 20 20 10 10 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Prediction Probabilities Prediction Probabilities (c) (d) Figure 5: Distribution of the miRNA-disease edge predictions of the RNA-KG link prediction module. Node2vec predictions: (a) on the test set of RNA-KGΔ𝑑𝑖𝑠𝑒𝑎𝑠𝑒 ; (b) on the test set of RNA-KGΔ10% ; (c) of the edges predicted by SPIRES with RNA-KGΔ𝑑𝑖𝑠𝑒𝑎𝑠𝑒 ; (d) of the edges predicted by SPIRES with RNA-KGΔ10% . In c)-d), ’blue’/’orange’ bars represent triples ’included’/’not included’ in the view. In the first one, named RNA-KGΔ𝑑𝑖𝑠𝑒𝑎𝑠𝑒 , the test set corresponds to triples involving miRNAs and diseases from the source RNAdisease [20], while training set corresponds to the remaining miRNA-disease triples of RNA-KG; in the second one, named RNA-KGΔ10% , we randomly included in the test set 10% of miRNA-disease triples, and in the training set the remaining 90%, independent of their original source, guaranteeing to maintain the graph connectivity, according to a connected Monte-Carlo hold-out strategy [17]. We directly applied node2vec to the prediction of miRNA-disease edges according to the RNA- KGΔ𝑑𝑖𝑠𝑒𝑎𝑠𝑒 and RNA-KGΔ10% experimental settings, using a Multi-Layer-Perceptron trained on the node2vec edge embeddings. The default parameters adopted in GRAPE have been chosen. Figure 5a and 5b show that the triples in the test set exhibit high probabilities, in both experimental settings. As reported in Figure 5a, with RNA-KGΔ𝑑𝑖𝑠𝑒𝑎𝑠𝑒 , ∼63% of these triples are associated with a score higher than 0.6. In the case of RNA-KGΔ10% , we notice that ∼88% of the test set was correctly classified with a probability higher than 0.6 and ∼74% with a probability higher than 0.8 (see Figure 5b). Node2vec is thus a reasonable predictor of miRNA-disease edges to be included in the RNA-KG and can be used to assess the plausibility of SPIRES predictions. To assess the ability of node2vec in evaluating the plausibility of the triples extracted by SPIRES, we have considered true positive triples extracted from our manually curated dataset involving miRNAs and diseases. Figures 5c and 5d show the distribution of the probabilities predicted by node2vec on the miRNA-disease edges extracted by SPIRES. Specifically, blue columns represent the number of miRNA-disease triples that are already included in RNA-KG, whereas orange columns represent the number of triples that are missing in RNA-KG. In both cases, node2vec is able to correctly classify almost all the tuples already present in the partial KG but can also discriminate between plausible and implausible new triples, offering a potential validation tool. Indeed in both experimental settings, we can identify a set of edges included in RNA-KG that are predicted with a high probability by both SPIRES and node2vec (blue bars), but also a set of edges extracted by SPIRES and predicted with a high probability by node2vec, even if these edges are not present in RNA-KG (orange bars). These last edges can be considered as possible new candidates for miRNA-disease relationships. In Figures 5c and 5d, the orange bars denote relationships identified by SPIRES, yet assigned a low probability by node2vec. These edges can be considered “uncertain” in the sense that they are not confirmed by an independent edge prediction method that exploits the topological characteristics of the RNA-KG. We believe these results can be improved by considering expanded views of RNA-KG and more complex ML methods capable of accommodating its inherent heterogeneity. 5. Concluding remarks In this paper we have described the initial steps in the design and development of the SPIREX system for the extraction of meaningful triples from scientific papers that exploit RNA-KG as a gold standard for checking the plausibility of the extracted triples. The initial experimental results are encouraging of the effectiveness of the proposed tool. At the current stage, we have used a basic link prediction measure for assessing the relationship’s plausibility according to the knowledge graph’s current state. However, a much more accurate measure should be developed that takes into account other factors (like the number of times the relationship has been identified in different sources, the presence of the relationship in other sources of information, or the coherence of the relationship with respect to the other triples extracted from the same scientific paper). We are also considering the adoption of other link prediction methodologies, especially those for heterogeneous graphs that can easily scale with big KGs. Finally, even if the approach has been tested in the context of RNA-KG, we would like to generalize it to other application domains that exploit biomedical KGs (e.g. [21]) for extracting new facts from texts. Datasets. Experiments have been realized by using the following datasets: (schema and docs) https://doi.org/10.5281/zenodo.10671796; (RNA-KG): https://doi.org/10.5281/zenodo.10078876. Acknowledgements. This research was in part supported by the “National Center for Gene Therapy and Drugs based on RNA Technology”, PNRR-NextGeneration EU program [G43C22001320007] and in part by the MUSA - Multilayered Urban Sustainability Action - Project, funded by the PNRR-Next- Generation EU program ([G43C22001370007], Code ECS00000037). References [1] E. Cavalleri, et al., RNA-KG: An ontology-based knowledge graph for representing interactions involving RNA molecules, 2023. arXiv:2312.00183. [2] R. Bommasani, et al., On the opportunities and risks of foundation models, 2021. arXiv:2108.07258. [3] A. J. Thirunavukarasu, et al., Large language models in medicine, Nature Medicine 29 (2023) 1930–1940. doi:10.1038/s41591-023-02448-8. [4] Z. Ji, et al., Survey of hallucination in natural language generation, ACM Computing Surveys 55 (2023) 1–38. doi:10.1145/3571730. [5] A. Ettinger, What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models, Transactions of the Association for Computational Linguistics 8 (2020) 34–48. doi:10.1162/tacl_a_00298. [6] J. H. Caufield, et al., Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning, Bioinformatics 40 (2024) btae104. doi:10.1093/bioinformatics/btae104. [7] S. Moxon, et al., The Linked Data Modeling Language (LinkML): A General-Purpose Data Modeling Framework Grounded in Machine-Readable Semantics, in: International Conference on Biomedical Ontologies, 2021, pp. 148–151. URL: https://ceur-ws.org/Vol-3073/paper24.pdf. [8] R. Jackson, et al., OBO Foundry in 2021: operationalizing open data principles to evaluate ontologies, Database 2021 (2021). doi:10.1093/database/baab069. [9] E. Cavalleri, M. Mesiti, On the extraction of meaningful RNA interactions from scientific publications through LLMs and SPIRES., In: 8th Int’l workshop on Data Analytics solutions for Real-LIfe APplications., 2024. [10] P. N. Robinson, et al., The human phenotype ontology: A tool for annotating and analyzing human hereditary disease, The American Journal of Human Genetics 83 (2008) 610–615. doi:10.1016/j. ajhg.2008.09.017. [11] N. A. Vasilevsky, et al., Mondo: Unifying diseases for the world, by the world, 2022. doi:10.1101/ 2022.04.13.22273750. [12] M. Ashburner, et al., Gene ontology: tool for the unification of biology, Nature Genetics 25 (2000) 25–29. doi:10.1038/75556. [13] C. Mungall, et al., oborel/obo-relations: 2023-08-18 release, 2023. doi:10.5281/zenodo.8263469. [14] K. Eilbeck, et al., The sequence ontology: a tool for the unification of genome annotations, Genome Biology 6 (2005). doi:10.1186/gb-2005-6-5-r44. [15] E. Cavalleri, et al., A meta-graph for the construction of an rna-centered knowledge graph, in: Bioinformatics and Biomedical Engineering, Springer Nature Switzerland, Cham, 2023, pp. 165–180. doi:10.1007/978-3-031-34953-9_13. [16] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, 2016. arXiv:1607.00653. [17] L. Cappelletti, et al., GRAPE for fast and scalable graph processing and random-walk-based embed- ding, Nature Computational Science 3 (2023) 552–568. doi:10.1038/s43588-023-00465-8. [18] H. Touvron, et al., Llama 2: Open foundation and fine-tuned chat models, 2023. [19] T. Groza, H. Caufield, D. Gration, G. Baynam, M. A. Haendel, P. N. Robinson, C. J. Mungall, J. T. Reese, An evaluation of GPT models for phenotype concept recognition, 2023. arXiv:2309.17169. [20] J. Chen, et al., RNADisease v4.0: an updated resource of RNA-associated diseases, providing RNA-disease analysis, enrichment and prediction, Nucleic Acids Research 51 (2022) D1397–D1404. doi:10.1093/nar/gkac814. [21] T.J. Callahan, et al., An Open-Source Knowledge Graph Ecosystem for the Life Sciences, Scientific Data 11(1) (2024) 363. doi:s41597-024-03171-w.