=Paper= {{Paper |id=Vol-3651/DARLI-AP_paper6 |storemode=property |title=On the Extraction of Meaningful RNA Interactions from Scientific Publications through LLMs and SPIRES |pdfUrl=https://ceur-ws.org/Vol-3651/DARLI-AP-6.pdf |volume=Vol-3651 |authors=Emanuele Cavalleri,Marco Mesiti |dblpUrl=https://dblp.org/rec/conf/edbt/CavalleriM24 }} ==On the Extraction of Meaningful RNA Interactions from Scientific Publications through LLMs and SPIRES== https://ceur-ws.org/Vol-3651/DARLI-AP-6.pdf
                                On the extraction of meaningful RNA interactions from
                                Scientific Publications through LLMs and SPIRES
                                Emanuele Cavalleri1 , Marco Mesiti1
                                1
                                    AnacletoLab - Dipartimento di Informatica, Università degli Studi di Milano, Via Celoria 18, Milano


                                                                             Abstract
                                                                             Knowledge graphs (KGs) are useful tools to uniformly represent and integrate heterogeneous information about a domain
                                                                             of interest. However, they are inherently incomplete; therefore, new facts should be introduced by extracting them from
                                                                             structured and unstructured data sources. Starting from RNA-KG, the first KG tailored for representing different kinds of
                                                                             RNA molecules that we recently developed, in this paper we evaluate the use of SPIRES for extracting interactions among
                                                                             bio-entities involving RNA molecules from scientific papers guided by the RNA-KG schema. SPIRES is a general-purpose
                                                                             knowledge extraction system for mining information conforming to a specified schema. A customized prompt is generated
                                                                             and submitted to a Large Language Model (LLM) along with a text to extract a set of RDF triples adhering to the schema
                                                                             constraints. The experiments show a high accuracy in extracting interactions from the scientific literature.

                                                                             Keywords
                                                                             RNA-based technologies, Knowledge Graphs, RNA-drug discovery, Large Language Models



                                1. Introduction                                                                                                           relevant information from an input text, it adopts zero-
                                                                                                                                                          shot learning to identify and extract relevant entities
                                The “RNA world” represents a novel frontier for the study                                                                 and relationships among them, which are then normal-
                                of fundamental biological processes and human diseases                                                                    ized and grounded through ontologies and vocabularies.
                                and is paving the way for the development of new drugs                                                                    SPIRES is a general-purpose approach that can be used
                                tailored to the patient’s biomolecular characteristics. Al-                                                               across a variety of domains and does not require spe-
                                though scientific data about coding and non-coding RNA                                                                    cific training/tuning on the considered domain. SPIRES
                                molecules are continuously produced and made available                                                                    adopts an engineering approach for creating prompts
                                from public repositories, they are scattered across differ-                                                               for interacting with an LLM (like GPT [5], Llama 2 [6],
                                ent databases and in the scientific literature. A central-                                                                Mistral [7], and Zephyr [8]) to improve the quality of
                                ized, uniform, and semantically consistent representation                                                                 the generated responses [9]. In this way, technical chal-
                                of the knowledge on RNA is still lacking. We have re-                                                                     lenges for generative AI (e.g., constructing comprehen-
                                cently constructed RNA-KG [1], a knowledge graph inte-                                                                    sive real-world knowledge and improving the accuracy
                                grating biological knowledge about RNA molecules with                                                                     of automated responses) can be addressed.
                                their functional relationships with genes, proteins, and                                                                     In this paper, we discuss the initial experimental results
                                chemicals and biomedical ontological concepts. RNA-KG                                                                     that we obtained by applying SPIRES in the extraction of
                                includes around 600K nodes and 9M RDF triples repre-                                                                      interactions among bio-entities involving RNA molecules
                                senting reliable interactions involving RNA molecules                                                                     in the context of the PNRR project “Gene Therapy and
                                and related biomedical concepts extracted from more                                                                       Drugs based on RNA Technology”. The purpose of the ex-
                                than 50 public data sources according to 11 bio-ontologies.                                                               periments is to show the level of accuracy of the system in
                                RNA-KG is coupled with a meta-graph representing all                                                                      extracting interactions from the scientific literature and
                                the possible interactions involving RNA molecules.                                                                        investigate the possibility of combining RNA-KG with
                                   SPIRES (Structured Prompt Interrogation and Recur-                                                                     LLMs. Note that the extraction of interactions involving
                                sive Extraction of Semantics) [2] is a recently proposed                                                                  RNA molecules is particularly challenging for two rea-
                                approach to information extraction that exploits Large                                                                    sons. First, a well-recognized ontology for characterizing
                                Language Models (LLMs) [3] to identify instances of a                                                                     non-coding RNA molecules is still lacking, and then dif-
                                knowledge schema expressed in terms of LinkML [4]                                                                         ferent identifiers for representing the same bio-entity are
                                starting from plain texts. By identifying and extracting                                                                  adopted. Even if a more systematic evaluation should be
                                Published in the Proceedings of the Workshops of the EDBT/ICDT 2024                                                       conducted, the initial results are very encouraging.
                                Joint Conference (March 25-28, 2024), Paestum, Italy                                                                         The paper is structured as follows. Section 2 describes
                                Envelope-Open emanuele.cavalleri@unimi.it (E. Cavalleri);                                                                 the SPIRES approach and related approaches that inte-
                                marco.mesiti@unimi.it (M. Mesiti)                                                                                         grate LLMs with knowledge data. Section 3 presents the
                                Orcid 0000-0003-1973-5712 (E. Cavalleri); 0000-0002-9421-8566
                                                                                                                                                          LinkML schema that we have developed for interacting
                                (M. Mesiti)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative       with SPIRES. Section 4 describes the experimental results,
                                                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)                                            while Section 5 reports concluding remarks.



                                                                                                                                                      1




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Emanuele Cavalleri et al. CEUR Workshop Proceedings                                                                                                                    1–6


                                               "COX20 is essential for
      Protein:                                 the assembly of the             Text                      Bio-ontologies and DBs
        attributes:                            mitochondrial respiratory                                                          PRO          HGNC              ...
         label:                                chain complex IV (CIV)"
           description:
              the name of the protein
        annotations:
         annotators: sqlite:obo:pr

                 Schema                                                                                                                   Ground results
                                                Prompt                                                  GPT results
                                                                                                                                        conform to schema

                                                                               LLM
                                        From the text below,                                        raw_completion_output:              extracted_object:
                                        extract the following                                       'protein: COX20'                     protein: PR:000030199
                                        entities in the following                                                                       named_entities:
                                        format:                                                                                          - id: PR:000030199
                                         protein:                            nested structures


Figure 1: SPIRES workflow.



2. SPIRES and Related Work                                                                         Furthermore, in case relationships are identified, SPIRES
                                                                                               selectively retains only those aligned with the predefined
The population of a KG by extracting triples from un-                                          schema that can be grounded to the Relations Ontology
structured texts is an interesting research activity and the                                   (RO [14]). By exploiting standard identification schemes
advent of LLMs has boosted the interpretation of highly                                        adopted by the reference bio-ontologies, the system guar-
technical languages as shown on question-answering                                             antees the generation of triples that can be easily inte-
benchmarks [10]. However, these techniques have shown                                          grated into a biomedical KG.
different limitations, such as generating incorrect state-                                         SPIRES thus creates and refines prompts to maximize
ments due to hallucinations [11] and insensitivity to nega-                                    the effectiveness of LLMs by exploiting domain knowl-
tions [12], that cannot be tolerated in sensitive domains                                      edge encapsulated through the description of the classes
like precision medicine. SPIRES adopts: 𝑖) the knowledge                                       and relationships that we wish to include in the KG.
schema of a specific domain for the generation of prompts                                          As outlined in [9], the explicit and structured informa-
for reducing these drawbacks; and 𝑖𝑖) bio-ontologies for                                       tion contained in KGs can also be used for improving the
enhancing the quality of the produced information.                                             knowledge awareness of LLMs. KGs have been used: 𝑖)
    Figure 1 outlines the SPIRES workflow. SPIRES re-                                          in the training of the LLM [15, 16]; 𝑖𝑖) during the infer-
quires the specification of the knowledge schema ex-                                           ence stage for making available to the LLMs the latest
pressed in LinkML [4] to guide the system in the ex-                                           knowledge without retraining [17]; 𝑖𝑖𝑖) to improve the
traction of knowledge. A LinkML schema contains the                                            interpretability of LLMs by explaining the facts [18] and
classes of entities and relationships among them within                                        by enhancing the reasoning process of LLMs [19]. One of
the specified domain. Classes can also include attributes                                      the main disadvantages of solution 𝑖) is that the enhance-
(e.g., name, type, and list of synonyms) to enrich en-                                         ment of the knowledge contained in the KG requires a
tity description. The LinkML schema is automatically                                           retraining of the model which is a time (and money) con-
processed to generate a list of prompts through which                                          suming activity. For this reason, approaches of solution
SPIRES interacts with a LLM (e.g., GPT3, GPT4, Llama 2,                                        𝑖𝑖) are gaining momentum because they allow the sepa-
Mistral, and Zephyr). Each prompt of the list is submitted                                     ration of the text space and the knowledge space. In this
to the LLM for collecting information that is exploited                                        case, knowledge is injected at the time of inference.
for completing the following prompt by eventually con-
sidering the bio-ontologies (e.g., for changing a protein
symbol with the corresponding identifier in an ontology). 3. The SPIRES Schema for RNA-KG
This refinement recursive process improves the quality
of the information gathered through the LLM.                 For the creation of the schema needed for the application
                                                             of SPIRES, we considered the RNA-KG meta-graph [20]
Example 1. Suppose we wish to extract proteins from a that represents all the kinds of relationships involving
text. A LinkML expression can be generated for describing RNA molecules in the considered data sources. Starting
the class Protein with its properties and the adopted iden- from it, a UML class diagram was developed that for-
tification scheme (See Figure 1). A prompt is then generated mally describes the schema of the considered domain
for this class and used for extracting proteins. However, and can be used for identifying meaningful relationships
the result obtained by ChatGPT alone (in this case COX20 ) in the considered domain. Figure 2 shows an excerpt of
is not compliant with the Protein class structure. There- the generated UML class diagram that consists of four
fore, SPIRES exploits bio-ontologies (e.g. PRotein Ontology biological and biomedical classes (miRNA , gene , protein ,
– PRO [13]) to obtain an adequate result.                    and disease ) with six kinds of RO relationships.




                                                                                         2
Emanuele Cavalleri et al. CEUR Workshop Proceedings                                                                                                                          1–6


                                       miRNA                                                                                                                protein
                          n                                                                         gene
                              + id: String                                                                                                         + id: String
                                                          n                           n + id: Integer              1                           n
                              + description: String           regulates activity of                                      gene product of           + description: String
      regulates activity of                                                                                        n                           1
                                                                                          + type: String                has gene product
                              + sequence: String                 interacts with                                    n                           n + synonym: String list
                                                          n                           n + symbol: String               regulates activity of
                          n   + family name: String                                                                                                + ortholog: String list
                                        n
                                                                                                            n                                      + sequence: String
                                                                                                                       causally related to
                                                                                                  disease
                                                                                                                   n                                              n
                                                                                          + id: String

                                                                                          + description: String
                                                                                      n                            n
                                  causes or contributes to condition                      + synonym: String list                 causes or contributes to condition



Figure 2: Meta-graph of test to evaluate the capabilities of SPIRES.



   miRNA molecules are small non-coding RNAs that play                                            miRNAs, “mmu-” prefix murine miRNAs, mature miRNA
a central role in gene expression via interference path-                                          are designated with “miR-” substring whilst “mir” refers
ways and their misregulation is associated with several                                           to the stem-loop primary transcript). Labels can be then
diseases [21]. miRNA molecules can generically interacts                                          easily translated into miRBase accession identifiers using
with genes but also more precisely regulate the acti-                                             a look-up table.
vity of a gene when a miRNA molecule blocks the trans-
lation of a gene or promotes the degradation of gene’s                                            Example 2. A LinkML class used to specify causes or
product. Moreover, miRNA molecules can regulate                                                   contributes to condition relationships between pro-
the activity of other miRNAs because they form base-                                              teins and diseases is reported in Listing 1. In the expression,
pairing interactions with complementary miRNA mole-                                               we have to specify the need to extract triples representing
cules according to [22, 23]. The schema also contains                                             relationships between proteins and disease in which the
the relationships involving genes and proteins. Specif-                                           only admitted predicate is causes or contributes to
ically, the has gene product relation and its inverse                                             condition (RO:0003302 ). In the expression, samples of
gene product of are used for representing that differ-                                            the kinds of relationships that we wish to extract are re-
ent proteins are translated from the same gene (i.e. iso-                                         ported. The prompt generated for this class relies on the
forms); while the regulates activity of is used for                                               prompts generated for the classes protein and disease
representing that a subclass of the proteins (transcrip-                                          and used for the identification of these bio-entities from
tion factors) regulates the activity of genes, promoting                                          the scientific literature. Figure 3 shows an output obtained
or down-regulating their activity acting as enhancers or                                          by using SPIRES and the corresponding result obtained by
repressors. Both proteins and miRNAs are connected to                                             the simple application of ChatGPT. In the SPIRES’ output,
the disease class by the causes or contributes to                                                 the extracted interactions are already represented as triples
condition relation. The diagram also contains the main                                            that exploit the required identification scheme. Therefore,
properties that can be associated with these bio-entities                                         checking their presence in RNA-KG and, in case of new
(e.g., nucleotide/amino acid sequences, descriptions of                                           triples, their integration is facilitated.
molecules/diseases, synonyms).
   The proposed UML class diagram was translated into a
LinkML schema. Genes are annotated using HGNC [24]
IDs. This choice is motivated by the stability of the HGNC
                                                                                                  4. Experimental results
IDs even if a gene name or symbol changes. Proteins                                               In this section we discuss the experiments that we car-
are grounded to the PRotein Ontology (PRO) while dis-                                             ried out to evaluate SPIRES for extracting interactions
eases are grounded to both the Monarch Disease On-                                                involving RNA molecules. Moreover, we compare SPIRES
tology (Mondo [25]) and the Human Phenotype Ontol-                                                with ChatGPT (ver. GPT-3.5-turbo), which is the LLM
ogy (HPO [26]). miRNAs were left with no semantic an-                                             internally integrated in SPIRES, and with Llama 2 (ver.
notation since miRNA labels (e.g., hsa-let-7b-5p ) and                                            llama-2-70b-chat), another well-known and used LLM.
miRBase [27] accession identifiers (MIMAT0000063 ) are
CURIE prefixes not included in default SPIRES annota-
tors. We can manually retrieve miRNA molecules from                                               4.1. Corpus of Annotated Documents
relationships extracted from SPIRES since their labels fol-                                       To evaluate the extraction of relations aligned with the
low a pattern (for instance, “hsa-” prefix indicates human                                        meta-graph depicted in Figure 2, we manually selected a




                                                                                              3
Emanuele Cavalleri et al. CEUR Workshop Proceedings                                                                        1–6




Listing 1: LinkML template for protein-disease interaction.
ProteinDiseaseInteraction:
 description: A document that contains protein to
               disease relationships.
 is_a: TextWithTriples
 slot_usage:
   triples:
    range: ProteinToDiseaseRelationship
    annotations:
    prompt: >-                                                    Figure 3: Example of output for SPIRES and ChatGPT.
      A semi-colon separated list of protein to
      disease relationships. The relationship
      is "causes or contributes to condition".
      For example:                                                False Negative (FN) according to the manually tagged
      DNMT1 causes or contributes to condition                    paragraphs. Table 1 reports the obtained results for the
      Alzheimer disease;
                                                                  considered interactions ordered according to the F-score
      HOXA1 causes or contributes to condition
      Alzheimer disease.                                          measure. The obtained results indicate a consistent trend
                                                                  where recall tends to be lower than precision due to the
                                                                  prevalence of false negatives over false positives. We
                                                                  think this behavior is due to the difficulty in accurately ex-
corpus of 60 scientific articles gathered from PubMed, Re-        tracting precise relationships from text, especially in dis-
searchGate, and Google Scholar by specifying keyword-             tinguishing specific types of relationships. Furthermore,
based queries like: “disease”, “comorbidity”, “protein”,          we observe that disease-disease and miRNA-disease in-
“miRNA”, “miRNA regulation”, “gene”. From these doc-              teractions present a high F-score. These kinds of inter-
uments, we identified paragraphs containing useful in-            actions are widely studied in the literature and thus a
formation to be extracted (e.g., abstract, discussion, or         higher number of publications are available with respect
specific subsections within the domain of interest). In           to other interactions (like miRNA-miRNA interactions).
the identification of the paragraphs we have taken into           Consequently, the abundance of this kind of relationships
account the following guidelines: 𝑖) the paragraph should         contributes to a higher true positive rate. Conversely, the
contain different kinds of relations between bio-entities         F-score for protein-disease relations is notably low be-
(e.g., “miRNA-interacts with-gene” and “miRNA-regulates           cause it is influenced by low recall. We noticed that many
activity of-gene”) to evaluate the ability of SPIRES to           protein-disease relations are undetected, often because
identify the right relations according to the provided            they are expressed in complex ways within the text. For
meta-graph; 𝑖𝑖) the paragraph might also contain irrele-          instance, the interchangeable use of symbols like “/” and
vant relationships that should be discarded; 𝑖𝑖𝑖) different       “,” (e.g., “overexpressions in IL6/MEGF8/RELA, and also
identification schemes can be used in the paragraph to            TP53 are known to cause osteoporosis”). Additionally,
check the ability of SPIRES to correctly work with them.          mapping proteins to the PRO proves challenging when
Paragraphs have been classified according to the kind of          textual information is sparse or ambiguously expressed.
bio-entities that they describe and associated with the           For instance, the mention of “PMP-22” solely as “myelin
list of relationships that should be identified according         protein 22” instead of “peripheral myelin protein 22” (due
to the adopted meta-graph. For each kind of bio-entity,           to assumptions made by authors) can lead to inaccurate
the following table shows the number of paragraphs con-           grounding. Despite this, precision remains remarkably
taining relationships involving it (note that a paragraph         high and, in the biomedicine context, this is preferable
can contain more than one).                                       because it prioritizes certainty over ambiguity.
        Protein     Disease      miRNA       Gene                     We also compared our results with the average results
                                                                  achieved by SPIRES in other domains. A marginal im-
          44          58           37         21
                                                                  provement has been observed in the domain of name
   In the considered paragraphs, we have identified six           entity recognition for chemicals and diseases [2]. We be-
kinds of interactions among the considered bio-entities           lieve that the slightly enhanced accuracy is due to the use
(reported in the y-axis of the diagram in Figure 4).              of multiple ontology annotators such as PRO for proteins,
                                                                  Mondo and HPO for diseases, and RO for relations.
4.2. Accuracy of Interactions extraction
                                                                  4.3. Comparison with other LLMs
For evaluating the obtained predictions, we have used
standard metrics (precision, recall, and F-score) by con- For assessing the performance of SPIRES with respect
sidering the True Positive (TP), False Positive (FP), and to ChatGPT and Llama 2, we focused on a subset of 20



                                                              4
   Emanuele Cavalleri et al. CEUR Workshop Proceedings                                                                                                    1–6


                                               # Paragraphs   TP         FP       FN        F-score          Precision             Recall
                      disease-disease               16        54          5       10          0.88              0.92                0.84
                      miRNA-disease                 32        123        20       31          0.82              0.86                0.80
                      miRNA-miRNA                    1        19          1        7          0.82              0.95                0.73
                       gene-protein                 10        52          5       21          0.8               0.91                0.71
                       miRNA-gene                   13        14          3        5          0.78              0.82                0.74
                      protein-disease               24        42          7       60          0.56              0.86                0.41
                           Total                 (60 texts)   304        41      134          0.76               0.88               0.69
   Table 1
   Results for named entity recognition evaluation of SPIRES on relations involving protein, miRNA, disease, and gene entities.
   Grounding was performed against HGNC, PRO, Mondo, HPO, and RO.


                                                                                    0.77
                                                                                                                                                    TP
disease-disease                                                                                                                                     FP
                                                                                                                                                    FN

miRNA-disease                                                                                                 0.59


                                                                                                                                     0.47
miRNA-miRNA


                                                                                                                            0.34                   0.35
  gene-protein


  miRNA-gene
                                                                                                    0.18                                    0.17
                                                              TP
protein-disease                                               FP                                                     0.07
                                                              FN                            0.05


                  0       0.2           0.4           0.6          0.8                     SPIRES                Llama 2               ChatGPT
                                        Rate
                                                                                                           F-score          Precision         Recall
   Figure 4: TP, FP, and FN results for evaluation of SPIRES on                     SPIRES                   0.86              0.94            0.81
   relations involving protein, miRNA, disease, and gene entities.                 Llama 2                   0.74              0.89            0.64
                                                                                   ChatGPT                   0.64              0.73            0.57
                                                                             Figure 5: SPIRES vs Llama 2 vs ChatGPT on 20 texts.
   documents where we manually grounded instances and
   relationships of the extracted triples. For using ChatGPT
   and Llama 2 we have generated prompts that adhere to          an advantage of basic LLMs approaches, but it is not.
   the following pattern:                                        Indeed, the schema allows us to reduce the relationships
                                                                 to be extracted to only meaningful ones in the considered
            extract triples in the form
                                                                 domain. Finally, no lookup table can be exploited for
            " subject-relation-object "
                                                                 translating class instance names with the corresponding
            within this document: [...]
                                                                 identifiers in the bio-ontologies (thus requiring a manual
      This prompt does not guarantee to obtain the identi- identification of the identifiers). All these drawbacks are
   fiers for the subject and the object of the triples. However, avoided by the use of SPIRES.
   if we try to generate a further prompt with the explicit         As shown in the bottom part of Figure 5, SPIRES out-
   request of mapping the extracted concepts to appropriate performs ChatGPT or Llama 2 alone both in terms of
   terminologies, both ChatGPT and Llama 2 advise that the precision and recall. The histogram in Figure 5 points
   provided ontology identifiers are hypothetical and may out a high increment in TP rate and a sensible decrease in
   not correspond to actual ontology identifiers (so, hallu- FP and FN rates when adopting SPIRES instead of Chat-
   cinations can occur in this case). Therefore we decided GPT or Llama 2 alone for extracting relations that adhere
   to substitute the grounding process with our manually to a specified schema within texts.
   curated look-up tables [1].
      When using ChatGPT (or Llama 2) alone, we do not 5. Concluding remarks
   have to specify the schema, and results are produced
   through a single interaction with the user. Avoiding In this paper, we have reported the initial experimen-
   the specification of the schema might be interpreted as tation of the use of SPIRES for extracting triples from



                                                                         5
Emanuele Cavalleri et al. CEUR Workshop Proceedings                                                                                  1–6



the scientific literature related to RNA molecules by tak-                [9] S. Pan, et al., Unifying large language models and knowl-
ing advantage of the meta-graph we have realized for                          edge graphs: A roadmap, 2023. arXiv:2306.08302 .
the generation of RNA-KG. Even if a more systematic                      [10] S. Ateia, U. Kruschwitz, Is ChatGPT a Biomedical Expert?
analysis is required, the initial results are quite encour-                   – Exploring the Zero-Shot Performance of Current GPT
aging. To facilitate the reproducibility of our results, our                  Models in Biomedical Tasks, 2023. arXiv:2306.16108 .
                                                                         [11] Z. Ji, et al., Survey of hallucination in natural language
dataset and the LinkML template can be downloaded
                                                                              generation, ACM Computing Surveys 55 (2023) 1–38.
from: https://doi.org/10.5281/zenodo.10671796.                                doi:10.1145/3571730 .
   As future work, we would like to extend the approach                  [12] A. Ettinger, What BERT is not: Lessons from a new
by integrating the entire RNA-KG in different ways. First,                    suite of psycholinguistic diagnostics for language mod-
we will exploit the RNA-KG triples for enhancing the                          els, Transactions of the Association for Computational
prompts generated by SPIRES. Moreover, RNA-KG can                             Linguistics 8 (2020) 34–48. doi:10.1162/tacl_a_00298 .
be used for validating the plausibility of the generated                 [13] D. A. Natale, et al., The protein ontology: a structured
triples by using RNA-KG as a gold standard in the area.                       representation of protein forms and complexes, Nucleic
Furthermore, we will explore the KG-enhanced LLM in-                          Acids Research 39 (2010). doi:10.1093/nar/gkq907 .
ference approaches in combination with SPIRES for fur-                   [14] C. Mungall, et al., oborel/obo-relations: 2023-08-18 re-
                                                                              lease, 2023. doi:10.5281/zenodo.8263469 .
ther improving the precision of the system by injecting
                                                                         [15] Z. Zhang, et al., ERNIE: Enhanced language represen-
knowledge extracted from RNA-KG at inference time.
                                                                              tation with informative entities, in: Proc. of Annual
Finally, we would like to create a web environment for                        Meeting of the Association for Computational Linguis-
graphically showing to the user the predicted triples di-                     tics, 2019, pp. 1441–1451. doi:10.18653/v1/P19- 1139 .
rectly in the graphical representation of the portion of                 [16] C. Rosset, et al., Knowledge-aware language model pre-
the knowledge graph that will contain them. The user                          training, CoRR (2020). arXiv:2007.00655 .
can thus manually check the proposed triples and pro-                    [17] P. Lewis, et al., Retrieval-augmented generation for
vide feedback that will be handled afterward to improve                       knowledge-intensive NLP tasks, in: Proc. of the 34th
the quality of the predictions.                                               Int’l Conf. on Neural Information Processing Systems,
                                                                              Curran Associates Inc., Red Hook, NY, USA, 2020.
                                                                         [18] M. Danilevsky, et al., A survey of the state of explainable
Acknowledgements                                                              AI for natural language processing, in: Proc. of Int’l Conf.
                                                                              on Natural Language Processing, 2020, pp. 447–459.
This research was in part supported by the “National Center for          [19] B. Y. Lin, X. Chen, J. Chen, X. Ren, KagNet: Knowledge-
Gene Therapy and Drugs based on RNA Technology”, PNRR-                        aware graph networks for commonsense reasoning, 2019.
NextGeneration EU program [G43C22001320007] and in part                       arXiv:1909.02151 .
by the MUSA - Multilayered Urban Sustainability Action -                 [20] E. Cavalleri, et al., A meta-graph for the construction
Project, funded by the PNRR-NextGeneration EU program                         of an rna-centered knowledge graph, in: Bioinformatics
([G43C22001370007], Code ECS00000037).                                        and Biomedical Engineering, Springer, 2023, pp. 165–180.
                                                                              doi:10.1007/978- 3- 031- 34953- 9_13 .
                                                                         [21] G. J. Hannon, Rna interference, Nature 418 (2002)
References                                                                    244–251. doi:10.1038/418244a .
                                                                         [22] L. Guo, et al., miRNA–miRNA interaction implicates for
 [1] E. Cavalleri, et al., RNA-KG: An ontology-based knowl-
                                                                              potential mutual regulatory pattern, Gene 511 (2012)
     edge graph for representing interactions involving RNA
                                                                              187–194. doi:10.1016/j.gene.2012.09.066 .
     molecules, 2023. arXiv:2312.00183 .
                                                                         [23] E. C. Lai, et al., Complementary miRNA pairs suggest a
 [2] J. H. Caufield, et al., Structured Prompt Interrogation and
                                                                              regulatory role for miRNA:miRNA duplexes., RNA 10
     Recursive Extraction of Semantics (SPIRES): a method
                                                                              (2004) 171–175. doi:10.1261/rna.5191904 .
     for populating knowledge bases using zero-shot learning,
                                                                         [24] R. L. Seal, et al., Genenames.org: the HGNC resources
     Bioinformatics (2024). doi:10.1093/bioinformatics/
                                                                              in 2023, Nucleic Acids Research 51 (2022) D1003–D1009.
     btae104 .
                                                                              doi:10.1093/nar/gkac888 .
 [3] R. Bommasani, et al., On the opportunities and risks of
                                                                         [25] N. A. Vasilevsky, et al., Mondo: Unifying diseases for
     foundation models, CoRR abs/2108.07258 (2021).
                                                                              the world, by the world, medRxiv (2022). doi:10.1101/
 [4] S. Moxon, et al., The Linked Data Modeling Language
                                                                              2022.04.13.22273750 .
     (LinkML): A General-Purpose Data Modeling Framework
                                                                         [26] P. N. Robinson, et al., The human phenotype ontology:
     Grounded in Machine-Readable Semantics, in: Int’l Conf.
                                                                              A tool for annotating and analyzing human hereditary
     on Biomedical Ontologies, 2021, pp. 148–151.
                                                                              disease, The American Journal of Human Genetics 83
 [5] OpenAI, Gpt-4 tech. report, 2023. arXiv:2303.08774 .
                                                                              (2008) 610–615. doi:10.1016/j.ajhg.2008.09.017 .
 [6] H. Touvron, et al., Llama 2: Open foundation and fine-
                                                                         [27] A. Kozomara, et al., miRBase: from microRNA sequences
     tuned chat models, 2023. arXiv:2307.09288 .
                                                                              to function, Nucleic Acids Research 47 (2018) D155–D162.
 [7] A. Q. Jiang, et al., Mistral 7b, 2023. arXiv:2310.06825 .
                                                                              doi:10.1093/nar/gky1141 .
 [8] L. Tunstall, et al., Zephyr: Direct Distillation of LM Align-
     ment, 2023. arXiv:2310.16944 .




                                                                     6