1. Introduction

E. Cavalleri);

1613-0073

On the extraction of meaningful RNA interactions from Scientific Publications through LLMs and SPIRES

Emanuele Cavalleri

emanuele.cavalleri@unimi.it 0 1 2

Marco Mesiti

marco.mesiti@unimi.it 0 1 2 0 AnacletoLab - Dipartimento di Informatica, Università degli Studi di Milano , Via Celoria 18, Milano 1 RNA-based technologies , Knowledge Graphs, RNA-drug discovery, Large Language Models 2 Workshop Proce dings

2024

000 0 0002

Knowledge graphs (KGs) are useful tools to uniformly represent and integrate heterogeneous information about a domain of interest. However, they are inherently incomplete; therefore, new facts should be introduced by extracting them from structured and unstructured data sources. Starting from RNA-KG, the first KG tailored for representing diferent kinds of RNA molecules that we recently developed, in this paper we evaluate the use of SPIRES for extracting interactions among bio-entities involving RNA molecules from scientific papers guided by the RNA-KG schema. SPIRES is a general-purpose knowledge extraction system for mining information conforming to a specified schema. A customized prompt is generated and submitted to a Large Language Model (LLM) along with a text to extract a set of RDF triples adhering to the schema constraints. The experiments show a high accuracy in extracting interactions from the scientific literature.

1. Introduction

The “RNA world” represents a novel frontier for the study of fundamental biological processes and human diseases and is paving the way for the development of new drugs though scientific data about coding and non-coding RNA molecules are continuously produced and made available from public repositories, they are scattered across diferent databases and in the scientific literature. A centralized, uniform, and semantically consistent representation of the knowledge on RNA is still lacking. We have recently constructed RNA-KG [ 1 ], a knowledge graph integrating biological knowledge about RNA molecules with their functional relationships with genes, proteins, and chemicals and biomedical ontological concepts. RNA-KG includes around 600K nodes and 9M RDF triples representing reliable interactions involving RNA molecules and related biomedical concepts extracted from more than 50 public data sources according to 11 bio-ontologies.

RNA-KG is coupled with a meta-graph representing all the possible interactions involving RNA molecules.

nEvelop-O (M. Mesiti) CEUR sive Extraction of Semantics) [ 2 ] is a recently proposed approach to information extraction that exploits Large

Language Models (LLMs) [3] to identify instances of a knowledge schema expressed in terms of LinkML [4] starting from plain texts. By identifying and extracting

Published in the Proceedings of the Workshops of the EDBT/ICDT 2024

CEUR

Workshop Proceedings (CEUR-WS.org) relevant information from an input text, it adopts zeroshot learning to identify and extract relevant entities and relationships among them, which are then normalized and grounded through ontologies and vocabularies.

SPIRES is a general-purpose approach that can be used

cific training/tuning on the considered domain. SPIRES adopts an engineering approach for creating prompts for interacting with an LLM (like GPT [ 5 ], Llama 2 [ 6 ],

Mistral [7], and Zephyr [8]) to improve the quality of

the generated responses [9]. In this way, technical challenges for generative AI (e.g., constructing comprehensive real-world knowledge and improving the accuracy of automated responses) can be addressed.

In this paper, we discuss the initial experimental results

that we obtained by applying SPIRES in the extraction of interactions among bio-entities involving RNA molecules in the context of the PNRR project “Gene Therapy and

Drugs based on RNA Technology”. The purpose of the experiments is to show the level of accuracy of the system in extracting interactions from the scientific literature and investigate the possibility of combining RNA-KG with RNA molecules is particularly challenging for two rea

sons. First, a well-recognized ontology for characterizing non-coding RNA molecules is still lacking, and then different identifiers for representing the same bio-entity are adopted. Even if a more systematic evaluation should be conducted, the initial results are very encouraging.

The paper is structured as follows. Section 2 describes

the SPIRES approach and related approaches that integrate LLMs with knowledge data. Section 3 presents the

LinkML schema that we have developed for interacting with SPIRES. Section 4 describes the experimental results, while Section 5 reports concluding remarks.

CEUR ceur-ws.org Protein: attributes: label: description:

the name of the protein annotations: annotators: sqlite:obo:pr Schema

2. SPIRES and Related Work Furthermore, in case relationships are identified, SPIRES

selectively retains only those aligned with the predefined The population of a KG by extracting triples from un- schema that can be grounded to the Relations Ontology structured texts is an interesting research activity and the (RO [14]). By exploiting standard identification schemes advent of LLMs has boosted the interpretation of highly adopted by the reference bio-ontologies, the system guartechnical languages as shown on question-answering antees the generation of triples that can be easily intebenchmarks [10]. However, these techniques have shown grated into a biomedical KG. diferent limitations, such as generating incorrect state- SPIRES thus creates and refines prompts to maximize ments due to hallucinations [11] and insensitivity to nega- the efectiveness of LLMs by exploiting domain knowltions [12], that cannot be tolerated in sensitive domains edge encapsulated through the description of the classes like precision medicine. SPIRES adopts: ) the knowledge and relationships that we wish to include in the KG. schema of a specific domain for the generation of prompts As outlined in [9], the explicit and structured informafor reducing these drawbacks; and ) bio-ontologies for tion contained in KGs can also be used for improving the enhancing the quality of the produced information. knowledge awareness of LLMs. KGs have been used: )

Figure 1 outlines the SPIRES workflow. SPIRES re- in the training of the LLM [15, 16]; ) during the inferquires the specification of the knowledge schema ex- ence stage for making available to the LLMs the latest pressed in LinkML [ 4 ] to guide the system in the ex- knowledge without retraining [17]; ) to improve the traction of knowledge. A LinkML schema contains the interpretability of LLMs by explaining the facts [18] and classes of entities and relationships among them within by enhancing the reasoning process of LLMs [19]. One of the specified domain. Classes can also include attributes the main disadvantages of solution ) is that the enhance(e.g., name, type, and list of synonyms) to enrich en- ment of the knowledge contained in the KG requires a tity description. The LinkML schema is automatically retraining of the model which is a time (and money) conprocessed to generate a list of prompts through which suming activity. For this reason, approaches of solution SPIRES interacts with a LLM (e.g., GPT3, GPT4, Llama 2, ) are gaining momentum because they allow the sepaMistral, and Zephyr). Each prompt of the list is submitted ration of the text space and the knowledge space. In this to the LLM for collecting information that is exploited case, knowledge is injected at the time of inference. for completing the following prompt by eventually considering the bio-ontologies (e.g., for changing a protein symbol with the corresponding identifier in an ontology). 3. The SPIRES Schema for RNA-KG This refinement recursive process improves the quality of the information gathered through the LLM.

For the creation of the schema needed for the application

of SPIRES, we considered the RNA-KG meta-graph [20] Example 1. Suppose we wish to extract proteins from a that represents all the kinds of relationships involving text. A LinkML expression can be generated for describing RNA molecules in the considered data sources. Starting the class Protein with its properties and the adopted iden- from it, a UML class diagram was developed that fortification scheme (See Figure 1). A prompt is then generated mally describes the schema of the considered domain for this class and used for extracting proteins. However, and can be used for identifying meaningful relationships the result obtained by ChatGPT alone (in this case COX20) in the considered domain. Figure 2 shows an excerpt of is not compliant with the Protein class structure. There- the generated UML class diagram that consists of four fore, SPIRES exploits bio-ontologies (e.g. PRotein Ontology biological and biomedical classes (miRNA, gene, protein, – PRO [13]) to obtain an adequate result. and disease) with six kinds of RO relationships. regulates activity of n n

miRNA + id: String + description: String + sequence: String + family name: String n n n regulates activity of interacts with

gene n + id: Integer

+ type: String n + symbol: String disease

n + id: String n + description: String + synonym: String list 1 n n n n gene product of has gene product regulates activity of causally related to

protein + id: String n 1 + description: String n + synonym: String list + ortholog: String list + sequence: String n causes or contributes to condition causes or contributes to condition

miRNA molecules are small non-coding RNAs that play miRNAs, “mmu-” prefix murine miRNAs, mature miRNA a central role in gene expression via interference path- are designated with “miR-” substring whilst “mir” refers ways and their misregulation is associated with several to the stem-loop primary transcript). Labels can be then diseases [21]. miRNA molecules can generically interacts easily translated into miRBase accession identifiers using with genes but also more precisely regulate the acti- a look-up table. vity of a gene when a miRNA molecule blocks the translation of a gene or promotes the degradation of gene’s Example 2. A LinkML class used to specify causes or product. Moreover, miRNA molecules can regulate contributes to condition relationships between prothe activity of other miRNAs because they form base- teins and diseases is reported in Listing 1. In the expression, pairing interactions with complementary miRNA mole- we have to specify the need to extract triples representing cules according to [22, 23]. The schema also contains relationships between proteins and disease in which the the relationships involving genes and proteins. Specif- only admitted predicate is causes or contributes to ically, the has gene product relation and its inverse condition (RO:0003302). In the expression, samples of gene product of are used for representing that difer- the kinds of relationships that we wish to extract are reent proteins are translated from the same gene (i.e. iso- ported. The prompt generated for this class relies on the forms); while the regulates activity of is used for prompts generated for the classes protein and disease representing that a subclass of the proteins (transcrip- and used for the identification of these bio-entities from tion factors) regulates the activity of genes, promoting the scientific literature. Figure 3 shows an output obtained or down-regulating their activity acting as enhancers or by using SPIRES and the corresponding result obtained by repressors. Both proteins and miRNAs are connected to the simple application of ChatGPT. In the SPIRES’ output, the disease class by the causes or contributes to the extracted interactions are already represented as triples condition relation. The diagram also contains the main that exploit the required identification scheme. Therefore, properties that can be associated with these bio-entities checking their presence in RNA-KG and, in case of new (e.g., nucleotide/amino acid sequences, descriptions of triples, their integration is facilitated. molecules/diseases, synonyms).

The proposed UML class diagram was translated into a LinkML schema. Genes are annotated using HGNC [24] 4. Experimental results IDs. This choice is motivated by the stability of the HGNC IDs even if a gene name or symbol changes. Proteins In this section we discuss the experiments that we carare grounded to the PRotein Ontology (PRO) while dis- ried out to evaluate SPIRES for extracting interactions eases are grounded to both the Monarch Disease On- involving RNA molecules. Moreover, we compare SPIRES tology (Mondo [25]) and the Human Phenotype Ontol- with ChatGPT (ver. GPT-3.5-turbo), which is the LLM ogy (HPO [26]). miRNAs were left with no semantic an- internally integrated in SPIRES, and with Llama 2 (ver. notation since miRNA labels (e.g., hsa-let-7b-5p) and llama-2-70b-chat), another well-known and used LLM. miRBase [27] accession identifiers ( MIMAT0000063) are CURIE prefixes not included in default SPIRES annotators. We can manually retrieve miRNA molecules from 4.1. Corpus of Annotated Documents relationships extracted from SPIRES since their labels fol- To evaluate the extraction of relations aligned with the low a pattern (for instance, “hsa-” prefix indicates human meta-graph depicted in Figure 2, we manually selected a Listing 1: LinkML template for protein-disease interaction.

False Negative (FN) according to the manually tagged paragraphs. Table 1 reports the obtained results for the considered interactions ordered according to the F-score measure. The obtained results indicate a consistent trend where recall tends to be lower than precision due to the prevalence of false negatives over false positives. We think this behavior is due to the dificulty in accurately excorpus of 60 scientific articles gathered from PubMed, Re- tracting precise relationships from text, especially in dissearchGate, and Google Scholar by specifying keyword- tinguishing specific types of relationships. Furthermore, based queries like: “disease”, “comorbidity”, “protein”, we observe that disease-disease and miRNA-disease in“miRNA”, “miRNA regulation”, “gene”. From these doc- teractions present a high F-score. These kinds of interuments, we identified paragraphs containing useful in- actions are widely studied in the literature and thus a formation to be extracted (e.g., abstract, discussion, or higher number of publications are available with respect specific subsections within the domain of interest). In to other interactions (like miRNA-miRNA interactions). the identification of the paragraphs we have taken into Consequently, the abundance of this kind of relationships account the following guidelines: ) the paragraph should contributes to a higher true positive rate. Conversely, the contain diferent kinds of relations between bio-entities F-score for protein-disease relations is notably low be(e.g., “miRNA-interacts with-gene” and “miRNA-regulates cause it is influenced by low recall. We noticed that many activity of-gene”) to evaluate the ability of SPIRES to protein-disease relations are undetected, often because identify the right relations according to the provided they are expressed in complex ways within the text. For meta-graph; ) the paragraph might also contain irrele- instance, the interchangeable use of symbols like “/” and vant relationships that should be discarded; ) diferent “,” (e.g., “overexpressions in IL6/MEGF8/RELA, and also identification schemes can be used in the paragraph to TP53 are known to cause osteoporosis”). Additionally, check the ability of SPIRES to correctly work with them. mapping proteins to the PRO proves challenging when Paragraphs have been classified according to the kind of textual information is sparse or ambiguously expressed. bio-entities that they describe and associated with the For instance, the mention of “PMP-22” solely as “myelin list of relationships that should be identified according protein 22” instead of “peripheral myelin protein 22” (due to the adopted meta-graph. For each kind of bio-entity, to assumptions made by authors) can lead to inaccurate the following table shows the number of paragraphs con- grounding. Despite this, precision remains remarkably taining relationships involving it (note that a paragraph high and, in the biomedicine context, this is preferable can contain more than one). because it prioritizes certainty over ambiguity.

Protein Disease miRNA Gene We also compared our results with the average results 44 58 37 21 achieved by SPIRES in other domains. A marginal improvement has been observed in the domain of name

In the considered paragraphs, we have identified six entity recognition for chemicals and diseases [ 2 ]. We bekinds of interactions among the considered bio-entities lieve that the slightly enhanced accuracy is due to the use (reported in the y-axis of the diagram in Figure 4). of multiple ontology annotators such as PRO for proteins, Mondo and HPO for diseases, and RO for relations. 4.2. Accuracy of Interactions extraction For evaluating the obtained predictions, we have used standard metrics (precision, recall, and F-score) by con- For assessing the performance of SPIRES with respect sidering the True Positive (TP), False Positive (FP), and to ChatGPT and Llama 2, we focused on a subset of 20 4.3. Comparison with other LLMs miRNA-disease miRNA-miRNA gene-protein miRNA-gene protein-disease disease-disease miRNA-disease miRNA-miRNA gene-protein miRNA-gene protein-disease

Total documents where we manually grounded instances and relationships of the extracted triples. For using ChatGPT and Llama 2 we have generated prompts that adhere to the following pattern: an advantage of basic LLMs approaches, but it is not.

Indeed, the schema allows us to reduce the relationships to be extracted to only meaningful ones in the considered extract triples in the form domain. Finally, no lookup table can be exploited for "subject-relation-object " translating class instance names with the corresponding within this document: [...] identifiers in the bio-ontologies (thus requiring a manual This prompt does not guarantee to obtain the identi- identification of the identifiers). All these drawbacks are ifers for the subject and the object of the triples. However, avoided by the use of SPIRES. if we try to generate a further prompt with the explicit As shown in the bottom part of Figure 5, SPIRES outrequest of mapping the extracted concepts to appropriate performs ChatGPT or Llama 2 alone both in terms of terminologies, both ChatGPT and Llama 2 advise that the precision and recall. The histogram in Figure 5 points provided ontology identifiers are hypothetical and may out a high increment in TP rate and a sensible decrease in not correspond to actual ontology identifiers (so, hallu- FP and FN rates when adopting SPIRES instead of Chatcinations can occur in this case). Therefore we decided GPT or Llama 2 alone for extracting relations that adhere to substitute the grounding process with our manually to a specified schema within texts. curated look-up tables [ 1 ].

When using ChatGPT (or Llama 2) alone, we do not 5. Concluding remarks have to specify the schema, and results are produced through a single interaction with the user. Avoiding In this paper, we have reported the initial experimenthe specification of the schema might be interpreted as tation of the use of SPIRES for extracting triples from the scientific literature related to RNA molecules by tak- [9] S. Pan, et al., Unifying large language models and knowling advantage of the meta-graph we have realized for edge graphs: A roadmap, 2023. arXiv:2306.08302. the generation of RNA-KG. Even if a more systematic [10] S. Ateia, U. Kruschwitz, Is ChatGPT a Biomedical Expert? analysis is required, the initial results are quite encour- – Exploring the Zero-Shot Performance of Current GPT aging. To facilitate the reproducibility of our results, our Models in Biomedical Tasks, 2023. arXiv:2306.16108. dataset and the LinkML template can be downloaded [11] Z. Ji, et al., Survey of hallucination in natural language generation, ACM Computing Surveys 55 (2023) 1–38. from: https://doi.org/10.5281/zenodo.10671796. doi:10.1145/3571730.

As future work, we would like to extend the approach [12] A. Ettinger, What BERT is not: Lessons from a new by integrating the entire RNA-KG in diferent ways. First, suite of psycholinguistic diagnostics for language modwe will exploit the RNA-KG triples for enhancing the els, Transactions of the Association for Computational prompts generated by SPIRES. Moreover, RNA-KG can Linguistics 8 (2020) 34–48. doi:10.1162/tacl_a_00298. be used for validating the plausibility of the generated [13] D. A. Natale, et al., The protein ontology: a structured triples by using RNA-KG as a gold standard in the area. representation of protein forms and complexes, Nucleic Furthermore, we will explore the KG-enhanced LLM in- Acids Research 39 (2010). doi:10.1093/nar/gkq907. ference approaches in combination with SPIRES for fur- [14] C. Mungall, et al., oborel/obo-relations: 2023-08-18 rether improving the precision of the system by injecting lease, 2023. doi:10.5281/zenodo.8263469. [15] Z. Zhang, et al., ERNIE: Enhanced language represenknowledge extracted from RNA-KG at inference time. tation with informative entities, in: Proc. of Annual Finally, we would like to create a web environment for Meeting of the Association for Computational Linguisgraphically showing to the user the predicted triples di- tics, 2019, pp. 1441–1451. doi:10.18653/v1/P19- 1139. rectly in the graphical representation of the portion of [16] C. Rosset, et al., Knowledge-aware language model prethe knowledge graph that will contain them. The user training, CoRR (2020). arXiv:2007.00655. can thus manually check the proposed triples and pro- [17] P. Lewis, et al., Retrieval-augmented generation for vide feedback that will be handled afterward to improve knowledge-intensive NLP tasks, in: Proc. of the 34th the quality of the predictions. Int’l Conf. on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA, 2020. [18] M. Danilevsky, et al., A survey of the state of explainable Acknowledgements AI for natural language processing, in: Proc. of Int’l Conf. on Natural Language Processing, 2020, pp. 447–459.

This research was in part supported by the “National Center for [19] B. Y. Lin, X. Chen, J. Chen, X. Ren, KagNet: KnowledgeGene Therapy and Drugs based on RNA Technology”, PNRR- aware graph networks for commonsense reasoning, 2019. NextGeneration EU program [G43C22001320007] and in part arXiv:1909.02151. by the MUSA - Multilayered Urban Sustainability Action - [20] E. Cavalleri, et al., A meta-graph for the construction Project, funded by the PNRR-NextGeneration EU program of an rna-centered knowledge graph, in: Bioinformatics ([G43C22001370007], Code ECS00000037). and Biomedical Engineering, Springer, 2023, pp. 165–180. doi:10.1007/978- 3- 031- 34953- 9_13. [21] G. J. Hannon, Rna interference, Nature 418 (2002) References 244–251. doi:10.1038/418244a. [22] L. Guo, et al., miRNA–miRNA interaction implicates for potential mutual regulatory pattern, Gene 511 (2012) 187–194. doi:10.1016/j.gene.2012.09.066. [23] E. C. Lai, et al., Complementary miRNA pairs suggest a regulatory role for miRNA:miRNA duplexes., RNA 10 (2004) 171–175. doi:10.1261/rna.5191904. [24] R. L. Seal, et al., Genenames.org: the HGNC resources in 2023, Nucleic Acids Research 51 (2022) D1003–D1009.

doi:10.1093/nar/gkac888. [25] N. A. Vasilevsky, et al., Mondo: Unifying diseases for the world, by the world, medRxiv (2022). doi:10.1101/ 2022.04.13.22273750. [26] P. N. Robinson, et al., The human phenotype ontology:

A tool for annotating and analyzing human hereditary disease, The American Journal of Human Genetics 83 (2008) 610–615. doi:10.1016/j.ajhg.2008.09.017. [27] A. Kozomara, et al., miRBase: from microRNA sequences to function, Nucleic Acids Research 47 (2018) D155–D162. doi:10.1093/nar/gky1141.

[1]

Cavalleri , et al., RNA-KG : An ontology-based knowledge graph for representing interactions involving RNA molecules , 2023 . arXiv: 2312 . 00183 .

[2]

J. H.

Caufield , et al., Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning , Bioinformatics ( 2024 ). doi: 10 .1093/bioinformatics/ btae104.

[3]

Bommasani , et al., On the opportunities and risks of foundation models , CoRR abs/2108 .07258 ( 2021 ).

[4]

Moxon , et al., The Linked Data Modeling Language (LinkML): A General-Purpose Data Modeling Framework Grounded in Machine-Readable Semantics , in: Int'l Conf. on Biomedical Ontologies , 2021 , pp. 148 - 151 .

[5] OpenAI, Gpt-4 tech. report , 2023 . arXiv: 2303 . 08774 .

[6]

Touvron , et al., Llama 2 : Open foundation and finetuned chat models , 2023 . arXiv: 2307 . 09288 .

[7]

A. Q.

Jiang , et al., Mistral 7b , 2023 . arXiv: 2310 . 06825 .

[8]

Tunstall , et al., Zephyr: Direct Distillation of LM Alignment , 2023 . arXiv: 2310 . 16944 .