<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Reshaping Biomedical Scientific Literature in a RAG Pipeline for Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maël Lesavourey</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gilles Hubert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IRIT, Université de Toulouse</institution>
          ,
          <addr-line>118 route de Narbonne, 31062 Toulouse Cedex 9</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>57</fpage>
      <lpage>70</lpage>
      <abstract>
        <p>Biomedical Question Answering (BQA) poses specific challenges due to the specialized vocabulary and complex semantic structures of biomedical literature. Large Language Models (LLMs) have shown great performance in several Natural Language Understanding and Generation tasks. However, their efectiveness tends to drop in domain-specific contexts such as biomedicine. Polysemy, complex lexical structures, and the need of precise and factual information exacerbate their limitations. To address these issues, Retrieval-Augmented Generation (RAG) pipelines have become a promising approach, combining the strengths of retrieval methods with LLMs to incorporate domain-specific knowledge into the generation process. In this article, we investigate the role of context in enhancing the performance of RAG pipelines for BQA. We show that incorporating a context grounded on proper literature reshaping afects positively the quality of generated answers, improving both semantic and lexical metrics. We also show that it has more efects on Precision than on Recall. This work underscores the importance of structuring appropriately the context to enhance the performance of LLMs and assist them in processing and selecting relevant information.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Retrieval-Augmented Generation</kwd>
        <kwd>Biomedical Question Answering</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Answer Generation</kwd>
        <kwd>Scientific Literature Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Since their release, language models (LMs) such as BERT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and GPT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] have gradually been adopted
for a wide range of tasks related to Natural Language Understanding (NLU) and Processing (NLP). Their
ability to understand the semantic relation between words in a document has reshaped traditional
approaches in various fields like Information Retrieval (IR), achieving State-Of-The-Art (SOTA) results
in multiple tasks, e.g., document ranking, classification, and text generation [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. However, those
models do not perform well when applied to domain-specific corpora like biomedical literature and legal
documents [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. The main reasons are the particular characteristics of those texts which amplify the
semantic gap between general knowledge and specialized concepts. Biomedical literature is composed
of complex lexical structures like chemical formulas, proper nouns, and abbreviations. Moreover, the
understanding of such literature is harder due to its polysemy, for example the expressions “Heart
Attack”, “Myocardial Infarction”, “Cardiovascular Stroke” having the same meaning1.
      </p>
      <p>
        Addressing these challenges in the context of Biomedical Question Answering (BQA) tasks requires
careful consideration of the domain’s specific characteristics. A wide range of BQA tasks exists [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], each
one having its own particularities regarding the content of the corpus, response format and targeted
audience. We consider scientific literature as our source of information, while the query and its answer
should target specialized readers, and be written in natural language. This task sits at the intersection
of IR and language generation.
      </p>
      <p>
        A first method to consider the specific characteristics of biomedical corpora has been to use LMs
pre-trained on such texts [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ]. However, several works [11, 12, 13] have shown that, despite
enhanced performance, such models still lack semantic understanding. With recent large LMs (LLMs),
this method is not an option because it would be highly expensive to train one model from scratch.
Therefore, several methods have been proposed to address the task of knowledge incorporation into
LLMs. Retrieval-Augmented Generation (RAG) [14] combines text generation with relevant document
retrieval mechanisms to contextualize responses. In a diferent way, In-Context Learning (ICL) [ 15]
aims at aligning the generated responses with the user’s expectations by providing examples directly
in the model inputs. However, the efectiveness of those approaches is dependent on which context
is extracted and how it is structured, e.g., examples of pairs (query, answer), plain text from scientific
publications, or semantic predications.
      </p>
      <p>We study in this paper how to properly incorporate domain-specific knowledge extracted from
scientific publications into LLMs in order to overcome their limitations and what is the impact of adding
such context for BQA.</p>
      <p>In the remainder of this article, we first present related works on RAG and BQA. Then, we describe
the method implemented to address this task, followed by a detailed presentation of the models and
technologies used for its implementation and evaluation. We then analyze the results before concluding
with a discussion on the implications of our approach and future research opportunities.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>Our work is related to diferent domains, i.e, IR, LMs, and BQA, as introduced in the following sections.</p>
      <sec id="sec-2-1">
        <title>2.1. Information Retrieval</title>
        <p>First approaches in IR were based on lexical matching, using statistics measuring co-occurrences of
words between several texts (e.g., a document and a query). A well-known method, BM25, is based
on TF-IDF scoring and takes advantage of diferent concepts like term frequency, rareness, and text
length to compute a similarity score. Their main limitation lies in their inability to take into account
the semantic meaning of the text (e.g., use of synonyms or paraphrased terms). To this end, researchers
have shown interest in developing dense retrievers [16] that capture semantic relationships that go
beyond exact word matching.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Language Models</title>
        <p>
          The transformer architecture was introduced in [17]. It is based on the self-attention mechanism that
enables to capture both local and global dependencies of a sequence of tokens. Two major families of LMs
have emerged: encoder and decoder based models. BERT [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] has been the most widely studied
encoderbased model since its release. Its pre-train-then-fine-tune paradigm led to significant improvements in
tasks such as text classification and named entity recognition.
        </p>
        <p>
          In the same time, decoder-based models that focus on generating new tokens by predicting the next
word of a sequence have been developed. GPT-1 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] demonstrated that training on large corpora could
produce a generative model capable of handling several language comprehension and generation tasks.
GPT-2 [18] marked a breakthrough by considerably increasing the number of parameters of LLMs and
the size of the training corpus. Its ability to perform diferent tasks without fine-tuning also marked
a turning point, paving the way for ICL, which makes it possible to guide the behaviour of LLMs
without fine-tuning. More recently, the development of LLMs, including GPT-3 [ 19], LLaMA [20] and
Mixtral [21], has pushed the boundaries of what can be achieved with transformers, demonstrating
unprecedented capabilities in understanding and generating human-like text. They also highlighted the
limitations of LLMs in terms of biases (e.g., hallucinations [22]) and computational limitations due to
their size.
        </p>
        <p>RAG combines the strengths of retrieval methods and generative LLMs [14], bridging the gap between
IR and text generation. In this approach, a retriever selects relevant documents or passages based on
a query, and a generative model uses the retrieved information to produce a contextualised response
[23, 24]. By creating a dynamic and query-specific context, RAG enables LLMs to focus their attention
on the most relevant information, improving accuracy and reducing hallucinations [25]. This method
ofers a powerful alternative to models based exclusively on static parameters.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Biomedical Q&amp;A</title>
        <p>Over the past twelve years, BioASQ evaluation campaigns [26] have enabled the development of various
methods for BQA, which followed the evolution of IR and Answer Generation. Early approaches focused
on extractive summarization techniques based on lexical matching, e.g., TF-IDF or LexRank. Over the
years, participating teams started to use supervised and deep learning methods which outperformed
previous works.</p>
        <p>
          More recently, researchers have gained attention for transfer learning with models pre-trained on
general domain of BQA datasets and fine-tuned on the BioASQ dataset [
          <xref ref-type="bibr" rid="ref11">27</xref>
          ]. Another step forward
was the emergence of domain-specific LMs like PubMedBERT [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] which enabled to efectively encode
biomedical entities and relational information.
        </p>
        <p>
          With the emergence of LLMs, participating teams have naturally gained interest for RAG pipelines.
They use sparse [
          <xref ref-type="bibr" rid="ref12 ref13 ref14">28, 29, 30</xref>
          ] or hybrid [
          <xref ref-type="bibr" rid="ref15 ref16">31, 32</xref>
          ] methods for the retrieval part. Most of them employ a
re-ranking module to select more relevant articles. For answer generation, the proposed approaches
explore diferent context creation strategies (e.g., ICL, snippets extraction), and diferent models tuning
(e.g., prompt format, fine-tuning, parameter tuning). For more details on the approaches used on BQA
tasks, we invite the reader to refer to the survey [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>As mentioned previously, this work aims to address the issue of context reshaping in a RAG pipeline
applied to BQA. This section formalizes the problem and the method we propose to tackle it. An
overview of our pipeline is illustrated in Figure 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Question Answering</title>
        <p>This task can be defined as an answer generation depending on a context built from biomedical
publications. Let  be a biomedical question expressed in natural language, and  = {1, 2, ..., } a
large set of biomedical publications. The system aims to generate an answer  by using a LLM and a
context , which is extracted and potentially restructured from a subset ′ ⊂ . ′ is obtained by
running an IR module that should maximize Recall in order to contain the sought information.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Setting the Context</title>
        <p>The documents of ′ are decomposed into basic textual units, i.e., sentences. The obtained set
 = {1, 2, ..., } is composed of all the sentences extracted from ′. Each sentence  is encoded
into a vector space using an encoder. The embedding of a sentence is produced by computing
_ on the embedding of each token of the sentence. To simplify notations, we will only note:
 = {() |  ∈ },
() being the _ applied to the encoded tokens of  .</p>
        <p>To guide the LLM attention during context processing, semantically close elements are grouped
together. The intuition behind this idea is that a structured context will help the model “understand”
the information given in input, instead of having information dispatched in . The embeddings in 
are grouped into clusters using cosine similarity,  = {1, 2, ..., } where  denotes a cluster of
embeddings. We note  = {1, 2, ..., } the corresponding clusters of sentences, where each  is a
group of sentences of a similar “topic”:</p>
        <p>= { ∈  | ∀1, 2 ∈ , _(1, 2) ≥ ℎℎ}
For each cluster , a ranking algorithm is applied to identify informative sentences, ′ ⊂  and:
′ = (, ),
where  is an implementation of a ranking method and  refers to the number of selected
sentences.</p>
        <p>Several works show that reordering documents can afect LLMs’ performance and help them in
the context processing [23]. To create our final context , clusters in  ′ = {1′, 2′, ..., ′} are ranked
based on their relevance to the query . For a cluster ′, a cross-encoder produces a probability  of
being relevant to  :</p>
        <p>The most relevant clusters are then ranked to build :</p>
        <p>= _(, ′)
 = {′1 , ′2 , ...,  ′</p>
        <p>| 1 ≥ 2 ≥ ... ≥  },
where  &lt;  is the number of clusters to select.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Answer Generation</title>
        <p>The context , combined with instructions , and the query  are fed into a LLM to generate the answer :</p>
        <p>This methodology aims at generating highly contextualized and relevant answers to  by leveraging
specialized documents while minimizing noise and irrelevant information.</p>
        <p>=  (, , )</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>In this section, we present the datasets and metrics used to run our experiments. We also detail
implementation settings for the retriever, the sentence selection, the topic ranking, and the answer
generation modules.</p>
        <p>
          As shown in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], there are few datasets directly addressing the specific task we tackle. We chose to work
on the BioASQ-TaskB [
          <xref ref-type="bibr" rid="ref17">33</xref>
          ] dataset as it fits our specifications (see Section 1). BioASQ-TaskB is composed
of two phases. Phase A aims at retrieving the 10 most relevant publications for a given query from the
biomedical literature database PubMed2, and extract their relevant snippets. Phase B focuses on answer
extraction and generation by proposing an “exact answer” and an “ideal answer”. “Exact answers”
have a particular format depending on question type (“Yes/No”, “Summary”, “Factoid”, “List”). “Ideal
answers” are natural language texts that a biomedical expert could write to answer queries. To produce
answers, participating teams are provided with the ground truth from Phase A, i.e, relevant articles and
corresponding snippets. Since BioASQ 12, Phase A+ has been introduced. Its goal is the same than
Phase B but without ground truth from Phase A. All BioASQ-TaskB data are manually annotated by
biomedical experts, providing gold standards for various biomedical NLP tasks.
        </p>
        <p>We isolated queries and their corresponding “ideal answers” from BioASQ’s 11 and 12 campaigns,
which enabled to evaluate our work on two distinct collections composed of 327 and 340 biomedical
queries.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Metrics</title>
        <p>
          BioASQ organizing team ofers a manual evaluation of answers generated by participating systems. Each
annotator gives a score out of 5 for the precision, recall, readability, and repetition criteria. ROUGE2
and ROUGE-SU4 (Recall, F1) scores [
          <xref ref-type="bibr" rid="ref18">34</xref>
          ] are also provided. Manual scores are computed only while the
evaluation campaign is running. To evaluate our work and compare our models’ performance with the
methods proposed during the evaluation campaign, we have chosen to use ROUGE2 Recall, Precision,
and F1. These metrics will be referred to as R2-R, R2-P, and R2-F1 respectively. However, there is an
intrinsic limit to these lexical metrics when applied to text generation tasks. ROUGE2 evaluates the
bi-gram overlap between a reference text and a candidate response. The score obtained by an answer
semantically identical to the reference but using synonyms will obtained a very low score despite a
correct answer. To evaluate our models, we have therefore used a metric based on semantic similarity,
i.e., BERTScore [
          <xref ref-type="bibr" rid="ref19">35</xref>
          ]. On the one hand, we will be able to situate the performance of our approaches with
R2 metrics, on the other hand we will have a more accurate idea of their performance with semantic
similarities.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Retriever</title>
        <p>
          We built a sparse retriever using Pyserini, an open-source Python library derived from Anserini which
integrates multiple IR techniques. First, we indexed all MEDLINE citations except those for which the
abstract was unavailable (≈ 25M citations). For each query, we created a list of the thousand most
relevant articles to answer it. This follows the observations of [
          <xref ref-type="bibr" rid="ref20">36</xref>
          ], showing that this architecture
achieves a Recall@1000 greater than 90%. Considering the savings in ressources and computing time,
we assert that this solution is suitable enough.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Sentence selection</title>
        <p>
          The retriever makes it possible to find the publications that could include the context needed to respond
to the query. The second step is to select the right information among all publications. We decided to
work at the sentence level to incorporate knowledge related to the query. We chose to compute an
embedding of each sentence using an encoder-based model to enable a semantic comparison between
them. We used the SentenceTransformer [
          <xref ref-type="bibr" rid="ref21">37</xref>
          ] library along with BioLinkBERT-large [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] to produce
these embeddings. BioLinkBERT is a version of LinkBERT pre-trained on Biomedical corpora and
achieving the best overall performance on the BLURB benchmark [
          <xref ref-type="bibr" rid="ref22">38</xref>
          ]. SentenceTransformer computes
a sentence embedding by applying a mean pooling on the embeddings of the tokens composing this
sentence.
        </p>
        <p>We decided to group sentences by topic using a clustering method on the sentence embeddings.
Since there were several thousand sentences to compare, we used the community_detection algorithm
implemented by SentenceTransformer as it is designed to handle a large number of sentences. It
computes the cosine similarity between embeddings to determine groups and incorporates several
optimisations to manage large collections.</p>
        <p>After semantically grouping together the sentences, it is needed to identify which sentences of each
topic would compose the context. We implemented the TextRank algorithm and applied it to each
cluster to identify their salient sentences (i.e., 4, 10, or 15 sentences per topic).</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Ranker</title>
        <p>
          In order to have the most precise context possible, it could be beneficial to choose the topics and
eventually delete irrelevant ones for the given query. Furthermore, several studies have shown that
the organization of the context can impact LLMs’ performance in question answering tasks. Following
our previous work on document ranking in a multi-stage retrieval pipeline [
          <xref ref-type="bibr" rid="ref23">39</xref>
          ], we draw an analogy
between scientific publication ranking and topic ranking tasks. The former aims to rank documents
by order of relevance to a query. We showed that changing the granularity of such documents and
selecting relevant sentences among them instead of considering the whole document is beneficial. The
topic ranking is globally the same task, difering only by the fact we work on clusters composed of
semantically close sentences. We applied a BioLinkBERT cross-encoder fine-tuned on the BioASQ-TaskB
dataset. This model computes a probability of relevance used to rank the topics. Once the ranked list of
topics has been generated, we chose a fixed number to be used as context for the queries (i.e, the first 5,
10, or 15 topics depending on the experiments).
        </p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Answer Generation</title>
        <p>We have seen in the previous sections how to establish a context for answering biomedical questions.
To provide an answer based on a question and its context, we built an answer generation tool relying
on LLaMA. We used the third release of this model in its 8B parameters version3, called llama3.1-8B
in the remainder of this paper. This model is open-weight and achieved SOTA results compared to
LLMs of the same scale. Its architecture is optimized for high performance on Q&amp;A tasks. Moreover, its
powerful tokenizer enables to process a large number of input tokens, which is very important for ICL.
We also chose this model to reduce computational costs and energy consumption compared to LLMs
with higher number of parameters. To further reduce costs, we applied model quantization and used
4-bit precision for floating-point representation instead of 32-bit.</p>
        <p>The prompts we used to parameterize the model are reported in Figures 2, 3, and 4. We also ran an
experiment without context to evaluate the benefit of adding context. In these formulations, we first
give a role to the system, then we explain the input that we give to the system, and finally we specify
the task. We did not consider any prompt engineering optimization.
System Prompt
You are a biomedical expert providing answers.</p>
        <p>I will give a question and several context texts
about the question. Based on the context, give
a short answer to the question.
QUESTION: *A biomedical query*
CONTEXTS: *Sentences extracted from
biomedical literature*
ANSWER:</p>
        <p>System Prompt
You are a biomedical expert providing answers.</p>
        <p>I will ask a question and your role is to give a
short answer to the question.
System Prompt
You are a biomedical expert providing answers.</p>
        <p>I will give a question and several context texts
about the question. Based on the context, give
a short answer to the question. Moreover I will
give you 3 questions and their corresponding
answers as examples.</p>
        <p>User Prompt
EXAMPLES: *A set of 3 questions/answers
dependent on the question type*
QUESTION: *A biomedical query*
CONTEXTS: *Sentences extracted from
biomedical literature*
ANSWER:</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In this section, we present the experimental results obtained applying our approach on the two BQA
datasets described in Section 4.1. We evaluated its performances by studying the impact of context
incorporation and its reshaping. Then, we tested the efect of ICL by adding examples of (query, answer)
pairs.</p>
      <sec id="sec-5-1">
        <title>5.1. Influence of context texts</title>
        <p>The aim of these first experiments is to show the efect of diferent types of context reshaping. We
evaluate if few context text is enough for the LLM to perform well or if each piece of information needs
to be repeated in order to be taken into account.</p>
        <p>We developed three variants of the model to establish baselines. First, we generated answers using
llama3.1-8B without incorporating any context. Next, we used llama3.1-8B on the same dataset but
incorporating context by selecting 4 sentences per cluster and without applying any topic ranking.
Finally, we extracted what we call the “Exact Context”, which corresponds to the relevant snippets
provided in BioASQ dataset. In a real-world scenario, such information is not available and this variant
enables us to estimate the maximum scores achievable with this model configuration.</p>
        <p>The scores obtained by these three variants on the BioASQ11 dataset are reported in Table 1. We
observe that the basic system (without context) performs relatively well in terms of Recall but is very
weak when it comes to Precision, whether semantic (BERT-P) or lexical (R2-P). The incorporation
of context without ordering is undeniably beneficial, as it improves the basic system performance.
However, we note that the highest improvements are primarily observed in the semantic metrics,
indicating that while the LLM can leverage the context, it is not fully aligned with the vocabulary used
by the annotators.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Influence of context reshaping</title>
        <p>We studied the impact of organizing the context. To do so, we added the cluster-ranking module to
our previous experiments and generated answers while varying the parameters that define the context
format, i.e., the number of clusters selected and the number of sentences per cluster. Since each cluster
is associated with a topic present in the corpus, the objective here is to determine the required context
size for generating accurate responses and to evaluate whether the LLM needs repeated information to
efectively process it.</p>
        <p>Table 2 presents the scores obtained for these experiments. First, we observe that, with 4 sentences
per topic as in the previous experiments, selecting any number of clusters tends to decrease both lexical
and semantic Recall scores. This was expected as we intentionally limited the amount of information
retrieved. This loss is outweighed by the gain in Precision when selecting 10 clusters, as evidenced
by the improvements in F1 scores. It appears that using too few or too many topics decrease the
performance. We miss part of the information with few clusters, but adding too many introduces noise.
This observation is in line with the fact that the ranking model is optimized to return a list of 10 relevant
documents.</p>
        <p>Afterwards, we studied the efect of increasing the number of sentences in each topic with fixed
numbers of clusters. We generated answers with 5 or 10 clusters and for each configuration run the
experiment with 4, 10, and 15 sentences. We observe that each setup achieves higher scores as the
number of sentences increases. In this case, we push less relevant topics further away from the query
(regarding token distance in the sequence) without removing them. As a result, we give more weight to
the relevant topics. Therefore, it seems wise to help the LLM focus its attention on the more relevant
information without deleting less relevant ones.</p>
        <p>Note that best scores for this set of experiments are achieved when using the parameters leading
to the highest scores for each study (e.g., 15 sentences per clusters and 10 clusters). Moreover, we ran
a t-test between this variant and the results obtained by the baseline labeled “Unranked Topics” in
the previous section. The obtained p-values were lower than 0.05 on all metrics, meaning that all the
improvements are significant.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Influence of ICL</title>
        <p>We decided to complete the incorporation of structured context by combining it with ICL. For each
question type, we randomly extracted 3 examples of (question, answer) pairs from the BioASQ10 dataset.
ICL is expected to help the LLM better understand how to structure its responses and potentially
aligned itself on the vocabulary used by the annotators. Tables 3 and 4 show the scores obtained on
the BioASQ11 and BioASQ12 datasets, respectively. We conducted experiments by varying the same
parameters as in the previous section and using the prompt shown in Figure 4.</p>
        <p>First, we observe that, for the same parameters, adding ICL consistently decreases performance on
the two Recall measures: on average -5.17% on R2-R and -0.92% on BERT-R. This slight loss is more
than compensated by higher gains in both Precision and F1 scores: on average +22.34% on R2-P and
+4.32% on BERT-P. Table 5 shows the mean number of tokens in ground truth and in answers generated
with 10 sentences and 10 clusters. Generated answers are much bigger than the gold standard and ICL
tends to reduce answer length. This leads to retrieving slightly less information, but at the same time,
the information returned is much more accurate. This phenomenon is observable on both datasets.
The scores on lexical metrics are lower on BioASQ12 set. This can be explained by the fact that a new
annotator was involved in its creation. Consequently, the model has no prior insight into the vocabulary
used by this annotator. We ran a t-test between the best variant in Table 3 and our “Unranked Topics”
baseline. We found that the p-value associated with BERT-R was higher than 0.05, meaning the loss is
insignificant. All other p-values were lower than 0.05. The gains on Precision metrics are significant,
but so is the loss on R2-R.</p>
        <p>We compared our results (10 sentences and 10 clusters in Table 4) with other systems submitted
in Phase A+ of the BioASQ12 challenge4. The best submissions in terms of R2-R (32.01 to 38.68
depending on the batch) have significantly lower Precision (R2-F1 ranging from 12.44 to 19.23) than
our system (an average of 19.67 over the 4 batches). This indicates that our Recall-Precision trade-of
is better. Moreover, the systems achieving the highest R2-F1 scores (25.03 to 28.62) exhibit a better
trade-of but their corresponding Recall scores (R2-R ranging from 22.62 to 27.23) are lower than those
4https://participants-area.bioasq.org/results/12b/phaseAplus/
achieved by our system (average of 28.60). Considering that the top-performing runs used models with
much more parameters (e.g., GPT-3.5, GPT-3.5 Turbo, GPT-4), employed fine-tuning techniques, and
possibly leveraged metric-specific tuning (e.g., generated bigger answers to obtain a better Recall, used
a translation module to optimize bi-grams overlap), we can conclude that our approach is both relevant
and efective.
#sentences/cluster #clusters R2-R R2-P R2-F1 BERT-R BERT-P BERT-F1</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Generation tool BioASQ11</p>
      <p>BioASQ12
No ICL</p>
      <p>With ICL
Gold Standard
In this article, we presented several approaches to incorporate biomedical knowledge into a LLM. We
showed that answers generated with contextualized prompts has more influence on Precision than on
Recall. Moreover, improvements on semantic metrics are more important than on lexical ones, meaning
the generated answers are not easily aligned with a given vocabulary.</p>
      <p>Ranking clusters enhances the scores under specific conditions. It is essential to select enough clusters
to capture relevant information, but retrieving too many can introduce noise and degrade performance.
In addition, it seems beneficial to increase the number of sentences per cluster. This helps the LLM
focusing its attention on relevant information by pushing away less relevant information without
deleting it. Finally, we show that integrating ICL to a RAG pipeline, despite a slight loss on Recall,
enables major improvements in terms of Precision and F1 scores. The comparison with some of the
top-performing models from the BioASQ challenge shows that our approach achieves competitive
results.</p>
      <p>
        Future work will be dedicated to study alternative ways to incorporate context for answer generation,
e.g., using biomedical knowledge-bases to structure knowledge in semantic predications
(subjectpredicate-object triples) [
        <xref ref-type="bibr" rid="ref24 ref25">40, 41</xref>
        ]. Further investigations into optimizing context selection could improve
both answer quality and readability in real-world biomedical applications. Finally, it would be wise to
integrate citations directly in the answers so that readers can easily validate the generated information.
      </p>
      <p>Our work aligns with the challenge of eficiently extracting and structuring biomedical knowledge
from vast amounts of scientific literature. In fields like metabolomics, where researchers must analyze
large amounts of publications to interpret metabolic signatures, automated methods could significantly
assist knowledge retrieval. Existing tools like FORUM5 facilitate bibliographic exploration by linking
metabolites to biomedical concepts, but they remain limited in handling large-scale textual data. By
refining context selection and integrating structured knowledge representations, our approach could
help improving literature-based discovery in metabolomics and beyond.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Generative AI tools in order to: Grammar and
spelling check, Text Translation, Improve writing style. After using this tool/service, the authors
reviewed and edited the content as needed and take full responsibility for the publication’s content.
[11] Q. Dong, Y. Liu, S. Cheng, S. Wang, Z. Cheng, S. Niu, D. Yin, Incorporating explicit knowledge in
pre-trained language models for passage re-ranking, in: Proceedings of the 45th International ACM
SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, Association
for Computing Machinery, New York, NY, USA, 2022, p. 1490–1501. URL: https://doi.org/10.1145/
3477495.3531997. doi:10.1145/3477495.3531997.
[12] J. Tan, J. Hu, S. Dong, Incorporating entity-level knowledge in pretrained language model for
biomedical dense retrieval, Computers in Biology and Medicine 166 (2023) 107535.
[13] Q. Xie, P. Tiwari, S. Ananiadou, Knowledge-enhanced graph topic transformer for explainable
biomedical text summarization, IEEE Journal of Biomedical and Health Informatics 28 (2024)
1836–1847. doi:10.1109/JBHI.2023.3308064.
[14] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive nlp tasks, Advances
in Neural Information Processing Systems 33 (2020) 9459–9474.
[15] Q. Dong, L. Li, D. Dai, C. Zheng, J. Ma, R. Li, H. Xia, J. Xu, Z. Wu, B. Chang, X. Sun, L. Li, Z. Sui,
A survey on in-context learning, in: Y. Al-Onaizan, M. Bansal, Y.-N. Chen (Eds.), Proceedings
of the 2024 Conference on Empirical Methods in Natural Language Processing, Association for
Computational Linguistics, Miami, Florida, USA, 2024, pp. 1107–1128. URL: https://aclanthology.
org/2024.emnlp-main.64/. doi:10.18653/v1/2024.emnlp-main.64.
[16] J. Guo, Y. Cai, Y. Fan, F. Sun, R. Zhang, X. Cheng, Semantic models for the first-stage retrieval:
A comprehensive review, ACM Trans. Inf. Syst. 40 (2022). URL: https://doi.org/10.1145/3486250.
doi:10.1145/3486250.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,</p>
      <p>Attention is all you need, Advances in neural information processing systems 30 (2017).
[18] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are
unsupervised multitask learners, OpenAI blog 1 (2019) 9.
[19] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information
processing systems 33 (2020) 1877–1901.
[20] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,</p>
      <p>A. Fan, et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024).
[21] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas,</p>
      <p>E. B. Hanna, F. Bressand, et al., Mixtral of experts, arXiv preprint arXiv:2401.04088 (2024).
[22] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, P. Fung, Survey of
hallucination in natural language generation, ACM Computing Surveys 55 (2023) 1–38.
[23] F. Cuconasu, G. Trappolini, F. Siciliano, S. Filice, C. Campagnano, Y. Maarek, N. Tonellotto,
F. Silvestri, The power of noise: Redefining retrieval for rag systems, in: Proceedings of the 47th
International ACM SIGIR Conference on Research and Development in Information Retrieval,
SIGIR ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 719–729. URL:
https://doi.org/10.1145/3626772.3657834. doi:10.1145/3626772.3657834.
[24] O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-Brown, Y. Shoham,
Incontext retrieval-augmented language models, Transactions of the Association for Computational
Linguistics 11 (2023) 1316–1331.
[25] O. Ayala, P. Bechard, Reducing hallucination in structured outputs via retrieval-augmented
generation, in: Y. Yang, A. Davani, A. Sil, A. Kumar (Eds.), Proceedings of the 2024 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies (Volume 6: Industry Track), Association for Computational Linguistics, Mexico City,
Mexico, 2024, pp. 228–238. URL: https://aclanthology.org/2024.naacl-industry.19/. doi:10.18653/
v1/2024.naacl-industry.19.
[26] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger,
N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The
twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering,
in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková,
repository of biomedical semantic predications, Bioinformatics 28 (2012) 3158–3160.
70</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <article-title>Improving language understanding by generative pre-training</article-title>
          ,
          <year>2018</year>
          . URL: https://api.semanticscholar.org/CorpusID:49313245.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Pretrained transformers for text ranking: BERT and beyond</article-title>
          , in: G. Kondrak,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bontcheva</surname>
          </string-name>
          , D. Gillick (Eds.),
          <source>Proceedings of the</source>
          <year>2021</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . URL: https: //aclanthology.org/
          <year>2021</year>
          .naacl-tutorials.1/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-tutorials.
          <volume>1</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dou</surname>
          </string-name>
          , J. rong Wen,
          <article-title>Large language models for information retrieval: A survey</article-title>
          ,
          <source>ArXiv abs/2308</source>
          .07107 (
          <year>2023</year>
          ). URL: https://api.semanticscholar. org/CorpusID:260887838.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiang</surname>
          </string-name>
          , et al.,
          <article-title>Scientific large language models: A survey on biological &amp; chemical domains</article-title>
          ,
          <source>ACM Computing Surveys</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fergadiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Aletras</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          , LEGAL-BERT:
          <article-title>The muppets straight out of law school</article-title>
          , in: T. Cohn,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2020</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>2898</fpage>
          -
          <lpage>2904</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .findings-emnlp.
          <volume>261</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .findings-emnlp.
          <volume>261</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ying</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Biomedical question answering: a survey of approaches and challenges</article-title>
          ,
          <source>ACM Computing Surveys (CSUR) 55</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          , H. Cheng, Y. Gu,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <article-title>Fine-tuning large neural language models for biomedical natural language processing</article-title>
          ,
          <source>CoRR abs/2112</source>
          .07869 (
          <year>2021</year>
          ). URL: https://arxiv.org/abs/2112.07869. arXiv:
          <volume>2112</volume>
          .
          <fpage>07869</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yasunaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Linkbert: Pretraining language models with document links</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2203</volume>
          .
          <fpage>15827</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>r. Kanakarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kundumani</surname>
          </string-name>
          , M. Sankarasubbu,
          <article-title>BioELECTRA:pretrained biomedical text encoder using discriminators</article-title>
          , in: D.
          <string-name>
            <surname>Demner-Fushman</surname>
            ,
            <given-names>K. B.</given-names>
          </string-name>
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ananiadou</surname>
          </string-name>
          , J. Tsujii (Eds.),
          <source>Proceedings of the 20th Workshop on Biomedical Language Processing</source>
          , Association for Computational Linguistics, Online,
          <year>2021</year>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>154</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .bionlp-
          <volume>1</volume>
          . 16. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .bionlp-
          <volume>1</volume>
          .16.
          <string-name>
            <surname>A. García Seco de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bougiatiotis</surname>
          </string-name>
          , G. Paliouras,
          <string-name>
            <surname>BioASQ-QA</surname>
          </string-name>
          :
          <article-title>A manually curated corpus for Biomedical Question Answering</article-title>
          ,
          <source>Scientific Data</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>170</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ateia</surname>
          </string-name>
          , U. Kruschwitz,
          <article-title>Can open-source llms compete with commercial models? exploring the few-shot performance of current GPT models in biomedical tasks</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), Grenoble, France,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2024</year>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>78</fpage>
          -
          <lpage>98</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-07. pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Enhancing biomedical question answering with parameter-eficient finetuning and hierarchical retrieval augmented generation</article-title>
          ,
          <source>in: CLEF (Working Notes)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>117</fpage>
          -
          <lpage>129</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-10.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Merker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Viehweger</surname>
          </string-name>
          , Mibi at bioasq 2024:
          <article-title>retrieval-augmented generation for answering biomedical questions</article-title>
          ,
          <source>in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), Grenoble, France, volume
          <volume>3740</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>176</fpage>
          -
          <lpage>187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>T.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Jonker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Reis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Matos</surname>
          </string-name>
          , Bit. ua at bioasq 12:
          <article-title>From retrieval to answer generation</article-title>
          ,
          <source>CLEF Working Notes</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Panou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Dimopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reczko</surname>
          </string-name>
          ,
          <article-title>Farming open llms for biomedical question answering</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), Grenoble, France,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2024</year>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>188</fpage>
          -
          <lpage>196</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-17.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Paliouras, Overview of BioASQ Tasks 12b and Synergy12 in CLEF2024</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . García Seco de Herrera (Eds.),
          <source>Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [34]
          <string-name>
            <surname>C.-Y. Lin</surname>
            ,
            <given-names>ROUGE:</given-names>
          </string-name>
          <article-title>A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://aclanthology.org/W04-1013/.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with bert</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>09675</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>T.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A. A.</given-names>
            <surname>Jonker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Poudel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Matos</surname>
          </string-name>
          , Bit. ua at bioasq 11b:
          <article-title>Two-stage ir with synthetic training and zero-shot answer generation</article-title>
          .,
          <source>in: CLEF (Working Notes)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-BERT:
          <article-title>Sentence embeddings using Siamese BERT-networks</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3982</fpage>
          -
          <lpage>3992</lpage>
          . URL: https://aclanthology.org/D19-1410/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1410.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          , H. Cheng, M. Lucas,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <article-title>Domainspecific language model pretraining for biomedical natural language processing</article-title>
          ,
          <source>ACM Trans. Comput. Healthcare</source>
          <volume>3</volume>
          (
          <year>2021</year>
          ). URL: https://doi.org/10.1145/3458754. doi:
          <volume>10</volume>
          .1145/3458754.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lesavourey</surname>
          </string-name>
          , G. Hubert,
          <article-title>Enhancing Biomedical Document Ranking with Domain Knowledge Incorporation in a Multi-Stage Retrieval Approach</article-title>
          .,
          <source>in: 12th BioASQ Workshop at CLEF</source>
          <year>2024</year>
          , volume
          <volume>3740</volume>
          , Grenoble, France,
          <year>2024</year>
          . URL: https://hal.science/hal-04744454.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>G.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kumarage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Alghamdi</surname>
          </string-name>
          , H. Liu,
          <article-title>Can knowledge graphs reduce hallucinations in LLMs? : A survey</article-title>
          , in: K. Duh,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , S. Bethard (Eds.),
          <source>Proceedings of the</source>
          <year>2024</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>3947</fpage>
          -
          <lpage>3960</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>219</volume>
          /. doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>219</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kilicoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fiszman</surname>
          </string-name>
          , G. Rosemblat, T. C.
          <article-title>Rindflesch, Semmeddb: a pubmed-scale</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>