<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SINAI at BioASQ@CLEF 2025: A Multi-Stage RAG Pipeline for Biomedical Semantic Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sara Dueñas-Romero</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>L. Alfonso Ureña-López</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eugenio Martínez-Cámara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SINAI - Research Group. Center for Advanced Studies in ICT (CEATIC). Universidad de Jaén</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This working note presents a multi-stage Retrieval-Augmented Generation (RAG) pipeline for the challenging BioASQ Task 13 Synergy, which focuses on biomedical semantic Question Answering with interactive expert feedback and no dedicated training data. Our system employs question analysis with Named Entity Recognition (NER) for query enhancement, dynamic FAISS indexing with S-PubMedBert for relevant document and snippet retrieval, and a biomedical fine-tuned Llama-based LLM with few-shot prompting for generating exact and 'ideal' answers. Evaluations in Round 4 highlighted our top performance in snippet retrieval. While document retrieval improved with updated PubMed data, exact answer performance varied by question type. For 'ideal' answers, manual expert evaluation favored our optimized system despite automatic metrics suggesting otherwise. Error analysis revealed areas for future improvement, including inference strategies and answer granularity. Overall, our results demonstrate the efectiveness of combining updated retrieval, entity-driven queries, and tuned LLM prompting for biomedical QA in this interactive setting.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;natural language processing</kwd>
        <kwd>retrieval augmented generation</kwd>
        <kwd>question answering</kwd>
        <kwd>large language models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The Conference and Labs of the Evaluation Forum (CLEF) 2025 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] continues its established tradition
(since 2010) of fostering advancements in multilingual and multimodal information access through a
peerreviewed conference and a series of rigorous evaluation labs and workshops. Within this framework,
the BioASQ Task Synergy [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] emerges as a significant and timely challenge within biomedical semantic
Question Answering (QA), specifically targeting the retrieval and synthesis of information concerning
evolving issues in biomedicine.
      </p>
      <p>
        Distinct from conventional QA tasks that heavily rely on static training data, BioASQ Synergy
13 adopts a novel, interactive approach centered on incremental expert feedback provided across
four rounds of evaluation. Participants are tasked with developing systems capable of answering a
diverse set of English biomedical questions (yes/no, factoid, list, and summary) by retrieving relevant
PubMed articles, extracting pertinent snippets, and generating both exact and summary ’ideal’ answers.
Notably, this iteration of Synergy does not provide a dedicated training set, making the expert feedback
mechanism crucial for system renfiement. The test dataset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] comprises 74 questions with a distribution
of 31% yes/no, 26% list, 24% summary, and 19% factoid questions.
      </p>
      <p>To address this challenging task, we propose a system grounded in a Retrieval-Augmented Generation
(RAG) architecture. Our approach begins with a preliminary question analysis that developed into an
entity extraction methodology for query enhancement, which we utilize to formulate queries for the
PubMed API to eficiently retrieve relevant document identifiers, thereby overcoming the data volume
challenge.</p>
      <p>Subsequently, we employ a reranking strategy with semantic searching for efective document and
snippet retrieval from the title and abstract sections of the identified articles. These retrieved snippets
serve as context within carefully constructed few-shot example prompts to guide a Llama-based large
language model (LLM), fine-tuned with biomedical data, in generating the answers in the desired format.
Our submitted systems achieved top rankings on the ’snippet retrieval’ leaderboard, and demonstrated
strong performance in answer extraction, the details of which will be further discussed in the results
and error analysis sections.</p>
      <p>This paper will detail our information retrieval and indexing approach, the prompt engineering
strategy, and a comprehensive explanation of the system workflow, culminating in an analysis of the
results and identified errors.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The field of biomedical question answering (QA) has witnessed significant advancements in recent years,
fueled by progress in both information retrieval methodologies and the capabilities of large language
models (LLMs). While early BioASQ challenges primarily focused on static datasets and traditional
retrieval pipelines, the Synergy series, including previous iterations; like in CLEF 2024 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], introduces a
dynamic and interactive evaluation paradigm driven by expert feedback.
      </p>
      <p>Our work builds upon two crucial and interconnected strands of research: Retrieval-Augmented
Generation (RAG) architectures for knowledge-intensive natural language processing (NLP) tasks, and
the adaptation and fine-tuning of LLMs for enhanced performance within the biomedical domain.</p>
      <sec id="sec-2-1">
        <title>2.1. Large Language Models in Biomedical QA</title>
        <p>The recent surge in the capabilities of Large Language Models (LLMs) has profoundly impacted various
NLP tasks, and biomedical QA is no exception. These models, with their ability to understand and
generate human-like text, ofer unprecedented opportunities for extracting and synthesizing information
from complex biomedical literature. A key direction in this field involves fine-tuning general-purpose
LLMs on extensive biomedical corpora, such as PubMed abstracts, clinical notes, and other specialized
texts. This domain-specific adaptation significantly enhances the models’ understanding of medical
terminology, biological processes, and clinical contexts, leading to improved performance on biomedical
QA benchmarks.</p>
        <p>
          The Llama 3 "Herd" [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] exemplifies the advancements in open-source LLMs, showcasing a range of
model sizes optimized for diferent computational constraints while achieving state-of-the-art results
on both general and specialized evaluations. Similarly, the "Aloe" series of healthcare-focused LLMs [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ],
trained on rich clinical and research data, has demonstrated superior reasoning abilities in critical areas
like disease understanding and drug interactions, outperforming general-purpose models.
Furthermore, the development of models like MEDITRON-70B [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which underwent large-scale pre-training
specifically on medical data, underscores the benefits of domain-focused training for achieving high
performance in biomedical NLP tasks. These specialized LLMs form a crucial component in modern
biomedical QA systems, providing the generative power necessary to produce comprehensive and
accurate answers.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Retrieval-Augmented Generation</title>
        <p>Retrieval-Augmented Generation (RAG) has emerged as a highly efective paradigm for tackling
knowledge-intensive NLP tasks, particularly in domains characterized by vast and rapidly evolving
information, such as biomedicine. The core idea behind RAG is to augment the knowledge of a
generative model by explicitly retrieving relevant information from an external knowledge source before
generating the final output. This approach contrasts with relying solely on the parametric knowledge
encoded within the LLM’s weights, which can be limited or outdated, especially when dealing with
emerging topics as in the BioASQ Synergy challenge.</p>
        <p>
          Lewis et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] first formalized this framework for open-domain QA, demonstrating significant
improvements in answer accuracy and factuality by conditioning the language model on retrieved
relevant passages. Subsequent research has explored and extended the RAG paradigm for various
knowledge-intensive applications [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], highlighting its versatility and efectiveness in scenarios where
access to up-to-date and specific information is crucial. In the biomedical domain, RAG ofers a
compelling solution for navigating the extensive PubMed literature, enabling systems to dynamically
retrieve relevant abstracts and full-text articles in response to complex biomedical questions. This is
particularly relevant for the BioASQ Synergy task, where the questions often pertain to developing
issues, necessitating the integration of the latest research findings [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          Prior work within the BioASQ challenges, including systems described in CLEF 2024 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], has
successfully employed RAG architectures, further validating its suitability and efectiveness for biomedical
question answering. By combining the retrieval capabilities of information retrieval systems with the
generative power of LLMs, RAG ofers a promising avenue for building robust and accurate biomedical
QA systems capable of addressing the dynamic nature of biomedical knowledge.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed system</title>
      <p>To efectively tackle the multifaceted challenges presented by the BioASQ Task Synergy, particularly the
need for dynamic information retrieval and the generation of diverse answer formats based on evolving
biomedical knowledge, we propose a novel multi-stage pipeline anchored in a Retrieval-Augmented
Generation (RAG) architecture.</p>
      <p>As visually depicted in Figure 1, our system strategically integrates three key modules: data
preparation, relevant information retrieval, and response generation. The data preparation stage focuses on
the initial processing of the incoming biomedical questions. The subsequent IR module is designed for
the intelligent retrieval of relevant information from the vast PubMed repository.</p>
      <p>Finally, the response generation module leverages this retrieved context, guided by carefully
engineered prompts, to produce comprehensive answers tailored to the specific question type (summary,
yes/no, factoid, list) and adhering to the task’s requirements for article references, snippets, and ideal
answers. Our approach leverages state-of-the-art Natural Language Processing (NLP) models and
techniques within each stage to ensure both the accuracy and the relevance of the generated responses
in this demanding interactive biomedical QA task.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Preparation</title>
        <p>The initial phase focuses on acquiring and structuring the biomedical literature corpus required for the
task.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Data Download</title>
          <p>The system utilizes publicly available PubMed data. Specifically, the process involves downloading the
PubMed baseline corpus and subsequent update files up to the date specified for the relevant challenge
round. These files are initially obtained in xml.gz format from an FTP server 1. Following download, the
XML data is parsed and converted into a more manageable Comma-Separated Values (CSV) format.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Data Format</title>
          <p>To optimize storage and processing eficiency, the raw text data from the CSV files is transformed
into the Apache Parquet format. This columnar storage format significantly reduces storage footprint
and improves read performance. The corpus is further processed by segmenting the text (primarily
titles and abstracts as per task constraints) into overlapping chunks or snippets, as shown in Figure
2. A maximum chunk size of 512 characters is enforced, with an overlap of 64 characters between
consecutive snippets. This overlap helps preserve contextual information across snippet boundaries.</p>
          <p>Due to the large volume of data, this chunking process is performed in batches. For each batch, text
content is extracted from the CSV rows and segmented using a text splitter. Each resulting snippet is
stored as a structured record containing the text itself, a unique iterative snippet identifier (ID), the
source document’s PubMed Identifier (PMID), the section of the document it originated from (e.g.,
TITLE, ABSTRACT), and the start and end character positions of the snippet within its section.</p>
          <p>To handle the large scale of the data eficiently without loading entire datasets into memory, processing
relies on Polars LazyFrames. This allows for defining computation graphs on the data that are executed
only when results are materialized, minimizing memory usage. Each processed chunk is represented
as a Document object, encapsulating its text content and associated metadata (ID, PMID, SECTION,
START, END).</p>
          <p>Finally, a mapping file is generated. This file acts as an index, storing the relationship between PMIDs
and the Parquet files containing their corresponding pre-computed snippets. It records the start and end
PMID for the documents primarily contained within each Parquet file, facilitating eficient lookup of
relevant files based on a list of PMIDs. Notably, update files may contain documents with non-sequential
PMIDs. This prepared, chunked, and indexed corpus forms the basis for subsequent retrieval steps.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Relevant Information Retrieval Module</title>
        <p>This module is responsible for identifying and retrieving relevant textual evidence from the prepared
corpus in response to a given biomedical question. It involves two main stages: retrieving relevant
1https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/, https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/
document identifiers (PMIDs) and then extracting specific, semantically relevant text snippets from
those documents.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Document Retrieval</title>
          <p>This first step aims to identify a candidate set of PubMed articles likely to contain information relevant
to the input question.</p>
          <p>Question and Query Analysis. Initial experiments revealed that querying the PubMed API directly
with the full natural language question often resulted in insuficient or irrelevant document retrieval.
The presence of common words, stop words, and specific phrasing nuances can dilute the query’s focus
when using keyword-based or simple search mechanisms likely employed by the PubMed API. To
address this, a query enhancement strategy based on Named Entity Recognition (NER) is employed.</p>
          <p>
            The system utilizes the d4data/biomedical-ner-all model [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] , which is built upon
distilbert-baseuncased and specifically fine-tuned on the Maccrobat dataset. This dataset consists of clinical case
reports from PubMed Central, making the model particularly well-suited for identifying biomedical
entities within the target domain. The model is configured to extract entities primarily of type ’chemical’
and ’disease’, as these are often the core concepts in biomedical questions.
          </p>
          <p>As the employed DistilBERT-based NER model often outputs subword-tokenized entities, a crucial
aggregation step is implemented. This custom logic reconstructs full entity names by combining
consecutive subword tokens that share the same entity label (e.g., “trypa” and “##nosomiases” become
“trypanosomiases”).</p>
          <p>Simultaneously, the entity type is maintained, and a representative confidence score is computed
as the average of the subword token scores. This aggregation process is vital for transforming the
fragmented NER output into coherent and precise query terms, thereby significantly improving the
relevance of the retrieved documents from PubMed.</p>
          <p>PubMed API for PMID retrieval. The PubMed API2 is then queried using two distinct approaches
for each input question: once with the original, full question text, and separately with the aggregated
list of extracted ’chemical’ and ’disease’ entities. The primary goal of these queries is to retrieve the
PubMed Identifiers (PMIDs) of potentially relevant articles. While the API might return additional
information, such as full text excerpts, only the retrieved PMIDs are utilized in the subsequent stages of
the pipeline, adhering to the task constraints of using only pre-processed title and abstract data.</p>
          <p>The lists of PMIDs obtained from querying the original question and the extracted entities are then
combined. Duplicate PMIDs are removed to generate a final, unique set of candidate PMIDs associated
with the input question. This list forms the input for the snippet extraction phase.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Snippet extraction</title>
          <p>Given the list of relevant PMIDs, the next goal is to identify and rank the specific text snippets (chunks)
from the corresponding documents that are most semantically relevant to the input question. These
snippets serve as the direct evidence for the final answer generation module.</p>
          <p>The process begins by retrieving all pre-computed snippets associated with the identified PMIDs.
This is handled by a function that takes the list of PMIDs and the path(s) to the relevant Parquet file(s)
(located using the previously-generated distribution mapping) as input. It reads the specified files and
iflters the data to select only the rows whose PMID matches one of the input PMIDs, and converts these
rows into Document objects containing the chunk text and metadata (PMID, SECTION, START, END).</p>
          <p>
            Subsequently, a semantic search is performed to rank these retrieved snippets based on their relevance
to the question. This relies on a sophisticated embedding model: pritamdeka/S-PubMedBert-MS-MARCO
2https://pmc.ncbi.nlm.nih.gov/tools/developers/
[
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. This model is particularly suitable for this task as it is derived from
microsoft/BiomedNLPPubMedBERT-base-uncased-abstract-fulltext and has been specifically fine-tuned on the MS-MARCO
dataset for information retrieval tasks within the medical domain.
          </p>
          <p>It maps sentences and paragraphs into a 768-dimensional dense vector space, employing a mean
pooling strategy over token embeddings to generate the final sentence vector. Mean pooling is a
common and efective technique for creating robust sentence embeddings from transformer outputs,
ensuring contribution from all tokens.</p>
          <p>For eficient semantic search, a FAISS (Facebook AI Similarity Search) index is constructed. Notably, a
temporary, query-specific index is built in memory for each incoming question, rather than maintaining
a single, persistent index for the entire corpus. This design choice implies that the number of candidate
snippets retrieved per query (based on the initial PMID list) is computationally manageable, making
the overhead of per-query index creation acceptable. It simplifies system architecture by avoiding the
complexities of maintaining and updating a massive, persistent index of all PubMed snippets.</p>
          <p>The index performs an exact search based on the maximum inner product between the query vector
and the indexed vectors. When vectors are L2-normalized (as is standard practice for cosine similarity
search with sentence transformers), maximizing the inner product is equivalent to maximizing cosine
similarity. The choice of IndexFlatIP prioritizes retrieval accuracy, guaranteeing that the truly closest
snippets within the candidate set are identified. While approximate nearest neighbor indices like
IndexHNSW ofer faster search times on very large datasets , the exact search provided by IndexFlatIP
is preferred here to ensure the highest quality evidence is passed to the generation module, especially
since the candidate pool has already been narrowed by the PMID retrieval step.</p>
          <p>The index construction process involves:
1. Identifying the relevant Parquet files containing snippets for the query’s PMIDs using the mapping</p>
          <p>CSV.
2. Extracting all corresponding Document objects.
3. Eliminating duplicate chunks to prevent redundancy in the index. Duplicates are identified using
a unique key generated from a combination of the chunk’s text content and its metadata (PMID,
section, start, end position)
4. Embedding the text content of the unique snippets using the embedding model.
5. Adding these embeddings and their corresponding Document objects to the index.
6. Once the temporary index is built, the input question is embedded using the same model.
7. A similarity search is then performed, which retrieves the top 20 most similar snippets based on
cosine similarity (achieved via normalized embeddings and the inner product search), along with
their respective similarity scores.
8. Finally, the retrieved snippets are sorted in descending order based on their similarity scores.
9. The text content of these top-ranked snippets is concatenated into a single string, separated by
newline characters, creating the summaries input for the generation module.
10. The function returns both this concatenated summaries string and the ranked list of the top</p>
          <p>Document objects including their scores.</p>
          <p>By employing a specialized biomedical embedding model and a temporary, question-specific FAISS
index, the system eficiently identifies and ranks the most semantically relevant text snippets from the
retrieved documents.</p>
          <p>This focused approach ensures that the subsequent generation module is provided with highly
pertinent evidence, thereby maximizing the potential for accurate and comprehensive answers across
the diverse question types.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Generation Module</title>
        <p>The Generation Module utilizes a Large Language Model (LLM), Llama3.1-Aloe-Beta-8B, to synthesize
answers based on the retrieved snippets. Llama3.1-Aloe-Beta-8B is an open healthcare LLM that
achieves state-of-the-art performance on several medical tasks. The Aloe Beta family, which includes
7B, 8B, 70B, and 72B model sizes, is trained using a consistent recipe on top of both Llama3.1 and
Qwen2.5 model families. Aloe models are trained on 20 medical tasks, resulting in robust and versatile
healthcare capabilities. Evaluations demonstrate that Aloe models rank among the best in their class.</p>
        <p>Its goal is to produce responses that are not only accurate but also adhere to the specific format
requirements of the input question type (summary, yes/no, factoid, list) and the evaluation criteria
defined by the challenge. This involves carefully crafted prompting strategies, including few-shot
learning and a two-stage generation process for certain question types.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Few-Shot Prompting</title>
          <p>The core of the generation process relies on structured prompting designed for instruction-tuned LLMs.
A role-based format (System, User) is employed to guide the model’s behavior efectively.</p>
          <p>
            The System Prompt sets the context and constraints for the LLM. It defines the desired persona (e.g.,
acting as a “biomedical expert”), establishes strict operational boundaries (e.g., instructing the model to
base its answer solely on the provided context snippets), specifies the mandatory output format, and
encourages a step-by-step reasoning process (Chain-of-Thought, CoT) [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ].
          </p>
          <p>The User Prompt provides the specific inputs for the current question, dynamically inserted into a
predefined template. These inputs typically include the original question (question), the concatenated
text of the top-ranked retrieved snippets ({summaries}), and few-shot examples ({examples}), extract
from the challenge feedback on from Round 4, to guide the model’s response style and format. For
generating “exact” answers (detailed below), the user prompt also incorporates a summary of the
reasoning ({answer} summary) derived from a preceding “ideal” answer generation step.</p>
          <p>Few-Shot Learning is incorporated by including examples of high-quality question-answer pairs
from the feedback of previous challenge rounds within the prompt ({examples}). These examples
are curated using a dedicated function. This function processes feedback files, filters for questions
marked as answerReady, extracts and normalizes the ideal and exact answers (e.g., handling list formats
appropriately), and saves these curated examples.</p>
          <p>For ideal answers, the function extracts all questions of the specified type that have a non-empty ideal
answer from the feedback file, formats each as a question followed by its ideal answer, and presents
them in a standardized format (e.g., Question: “...” -&gt; "answer”: “..."). For exact answers,
the function similarly filters for questions of the target type with non-empty exact answers, and for each,
it constructs a prompt line including the question, a summary of the ideal answer (if available), and the
exact answer in a structured output format (e.g., Question: “...” -&gt; Summary of relevant
information: “...” - "answer”: “..."). Including these examples helps the LLM adapt to
the specific nuances, expected level of detail, and precise formatting requirements of the target task.</p>
          <p>A key aspect of the generation strategy is a Two-Stage Generation Process, employed for all question
types except “summary". This approach separates the synthesis of information from the extraction of
the final concise answer:</p>
          <p>Stage 1: Ideal Answer Generation</p>
          <p>The first stage aims to generate the ideal answer. The prompt includes the question, the retrieved
snippets (summaries), and few-shot examples. The objective is to produce a comprehensive,
paragraphsized summary, written in an expert tone, that directly answers the question using only the information
present in the provided snippets.</p>
          <p>This ideal answer includes a chain of thought explaining the reasoning process. For questions explicitly
designated as “summary” type, the output of this stage constitutes the final response. This initial step
leverages the LLM’s strength in synthesis and summarization, forcing it to reconcile potentially disparate
or subtly conflicting information from various snippets into a coherent narrative and logical reasoning
path.
},
{
}</p>
          <p>Listing 1: Ideal prompt for ’yesno’ questions
‘‘role’’: ‘‘system’’,
‘‘content’’: ‘‘You are an expert in the biomedical field. Your task is to answer yes
/no questions based solely on the knowledge provided in the fragments extracted
from the title and abstract sections of biomedical articles. You must ONLY use
the information in these fragments to infer your answer. Sometimes the context
does not provide a direct answer; in such cases, you must understand and join
all the information provided and infer a correct conclusion.Follow a brief chain
-of-thought process to arrive at your conclusion and include this reasoning in
your response. Provide your final output in JSON format with the keys ‘‘
chain_of_thought’’ and ‘‘answer’’, where answer is a single paragraph that
completely answers the question and summarizes the most relevant information
from the context in the style of a biomedical expert.If no answer can be
inferred from the context, set ‘‘answer’’ to exactly ‘‘No answer provided". Pay
attention to these examples of questions + answer types:{examples}In conclusion,
this must be the answer exact format you must generate:{{"chain_of_thought’’:
Detailed reasoning process, ‘‘answer’’: Paragraph detailing a complete
informative answer or ‘‘No answer provided"}}Do not use line breaks in your text
nor use double quotes like ’’ directly in your answer text, only use single
quotes like ’ or double quotes with an escape character symbol to not compromise
the JSON format and structure.The answer should resemble one given by an expert
."
‘‘role’’: ‘‘user’’,
‘‘content’’: ‘‘These are the fragments of information extracted from the articles
from where you must infer the answer to the given question: {summaries}The
question you must answer in the JSON format is: {question}"
Stage 2: Exact Answer Generation
This stage is executed only if the question type is not “summary” (i.e., for “yesno”, “factoid”, “list”).
• Input Preparation. The JSON output containing the ideal answer from Stage 1 is parsed, and a
concise summary of its chain-of-thought is extracted.
• Prompt Construction. A new prompt is constructed specifically for generating the exact answer.</p>
          <p>This prompt includes the original question, the same set of retrieved snippets (summaries) used
in Stage 1, relevant few-shot examples, and the summary of the reasoning derived from the ideal
answer’s chain of thought.
• Generation Goal. The LLM is tasked with generating a concise, factual answer in the specific
format required by the question type (e.g., ’yes’ or ’no’ for “yesno” questions; one or more entity
names for “factoid” or “list” questions). By providing the reasoning path from the ideal answer, this
stage guides the LLM to produce an exact answer consistent with the synthesized understanding
developed in Stage 1. This makes the task more constrained, efectively becoming an extraction
or classification based on the LLM’s own prior reasoning, reducing the likelihood of pulling
incorrect facts directly from potentially noisy snippets.</p>
          <p>The LLM inference is performed using a generate function. Key parameters are carefully tuned to
promote factual and consistent outputs: a low temperature (0.11) minimizes randomness, a “repetition
penalty” (1.15) discourages verbatim repetition, and “max new tokens” (1024) provides suficient length
for both the chain-of-thought and the answer components.</p>
          <p>Listing 2: Exact prompt for ’yesno’ questions
},
{</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Response Post-processing</title>
          <p>After the LLM generates the response, a final post-processing step is applied to ensure the output strictly
conforms to the expected format, which is crucial for automated evaluation pipelines and downstream
consumption. Minor variations in LLM output, such as missing characters or inconsistent casing, could
otherwise lead to evaluation failures.</p>
          <p>These post-processing steps act as essential guardrails, ensuring the system’s output is robustly
formatted and machine-readable, maximizing compatibility with automated evaluation scripts.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation and results</title>
      <p>In this section we discuss the performance of our three system variants across retrieval, snippet
extraction, exact answer generation, and ideal answer quality, drawing on both automatic and manual
evaluation metrics.</p>
      <p>The system names are listed below, we simplified the identifiers to improve the readability of the
analysis:
• Q&amp;A based on RAG: original identifier for [sinai-uja-v1]
• Q&amp;A based on RAG2: original identifier for [sinai-uja-v2]
• sinai_uja_RAG: original identifier for [sinai-uja-v3]</p>
      <p>The diferences between the 3 system lie on slight alterations in the language of the prompt; the
last system (sinai-uja-v3) being the most optimized for the task and the one shown before. Also, they
difer on the pubmed files used; for system sinai-uja-v1 the pubmed files were not updated to the last
permitted date. For the other 2 system the retrieval subset was the exact same and it is reflected in the
retrieval related metrics below.</p>
      <sec id="sec-4-1">
        <title>4.1. Results for the information retrieval module</title>
        <p>Since sinai-uja-v2 and sinai-uja-v3 share the same updated corpus, their retrieval metrics are identical.
These results confirm that simply refreshing the underlying PubMed data can drive meaningful
improvements in RAG pipelines for emerging biomedical topics. Overall these results are quite poor all
we will discuss the possible reasons for this in the error analysis section.</p>
        <p>The document retrieval performance was evaluated using standard metrics such as Mean Precision,
Recall, F-Measure, Mean Average Precision (MAP), and Geometric Mean Average Precision (GMAP).
Precision measures the proportion of retrieved documents that are relevant, while Recall measures
the proportion of relevant documents that are retrieved. F-Measure is the harmonic mean of Precision
and Recall. MAP considers the ranking of retrieved documents, rewarding systems that place relevant
documents higher in the list. GMAP is similar to MAP but uses a geometric mean, emphasizing
performance on questions where systems struggled.</p>
        <p>As shown in Table 1, updating the PubMed snapshot for v2 (and v3) yields substantial gains over v1,
which relied on an outdated index.</p>
        <p>Precision increases from 0.0353 to 0.0417 (+18%), indicating that a higher fraction of the top-10
retrieved articles are relevant. Recall jumps from 0.0824 to 0.1209 (+47%), reflecting a slightly broader
coverage of the gold-standard articles. MAP nearly doubles (0.044 → 0.0824) and GMAP rises from zero
to 0.1, showing that retrieval freshness is critical for ranking quality.</p>
        <p>On the other hand, snippet extraction performance, detailed in Table 2, utilizes similar metrics
adapted to account for potential overlap between returned and golden snippets. The oficial measure
for snippets in Phase A (which is analogous to the relevant material submission in Synergy) is Mean
F-Measure. Interestingly, system sinai-uja-v1 slightly outperformed sinai-uja-v2 and sinai-uja-v3
across all snippet extraction metrics.</p>
        <p>This suggests that while the updated document set improved document retrieval, the older document
set used by sinai-uja-v1 might have contained snippets that were, on average, slightly more aligned or
had better overlap with the golden snippets for the questions in this round, despite the overall lower
document recall. The identical performance of sinai-uja-v2 and sinai-uja-v3 in snippet extraction is
expected, given they used the same retrieval subset.</p>
        <p>We must highlight how our RAG-based pipelines achieved the highest snippet extraction scores
in Round 4, outperforming the next best competitor by nearly 0.20 points on the recall metric and
0.10 points across the rest. Our submissions led an otherwise tightly clustered field of nine teams,
demonstrating that our dynamic indexing and overlap-aware matching deliver markedly superior
snippet recall and precision—even in rapidly evolving biomedical domains.</p>
        <p>Mean Precision</p>
        <p>Recall F-Measure</p>
        <p>The results for the document and snippet retrieval modules present an intriguing dichotomy. While
the document retrieval metrics (Mean Precision, Recall, F-Measure, MAP, GMAP) indicate a relatively
modest performance across all three system variants, with the updated PubMed corpus demonstrably
improving these metrics for v2 and v3, the snippet extraction results paint a significantly diferent
picture.</p>
        <p>Notably, all our RAG-based systems achieved the highest scores in snippet extraction in Round 4,
outperforming competitors by a substantial margin, particularly in Recall and overall F-Measure. This
suggests that our methodology for identifying and ranking relevant text excerpts within the retrieved
documents is highly efective at pinpointing the most pertinent information, even if the initial document
retrieval phase could be further optimized.</p>
        <p>One possible explanation for this discrepancy is that while the initial set of retrieved documents
might not have perfectly aligned with all the gold-standard documents, our semantic search and ranking
strategy within those retrieved documents excels at identifying the key information-bearing snippets.
The slight outperformance of v1 in snippet extraction, despite its outdated document corpus, could
indicate that the specific content within its retrieved documents had a higher degree of overlap with
the gold-standard snippets for the questions in this round.</p>
        <p>The overall dominance of our systems in snippet retrieval underscores the strength of our dynamic
indexing and semantic matching approach in extracting crucial evidence for downstream answer
generation, even in the challenging context of evolving biomedical knowledge.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results for the answer generation module</title>
        <p>The answer generation module was evaluated based on the quality of both exact and ideal answers.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Evaluation of exact answers</title>
          <p>For Yes/No questions, accuracy measures the proportion of correctly answered questions. F1-Yes and
F1-No measure the F1 score for questions with "yes" and "no" as the golden answer, respectively. Macro
F1 is the unweighted average of F1-Yes and F1-No, serving as the oficial measure.</p>
          <p>Table 3 shows that both sinai-uja-v1 and the most optimized system, sinai-uja-v3, achieved the
highest Accuracy (0.9) and identical Macro F1 scores (0.899). System sinai-uja-v2 performed slightly
worse with an Accuracy of 0.8 and a Macro F1 of 0.7917. This suggests that the prompt language
optimization in sinai-uja-v3 was beneficial for Yes/No questions, bringing its performance in line with
the initial sinai-uja-v1 system.</p>
          <p>Factoid questions require returning a list of up to 5 entity names. Strict Accuracy counts a question
as correct only if the top returned entity matches the golden answer, while Lenient Accuracy considers
a question correct if the golden answer is anywhere in the top 5 list. Mean Reciprocal Rank (MRR), the
oficial measure, rewards systems that rank the correct answer higher.</p>
          <p>Table 4 indicates that sinai-uja-v3 achieved the best performance for Factoid questions with a
Strict Accuracy, Lenient Accuracy, and MRR of 0.4. Systems sinai-uja-v1 and sinai-uja-v2 had
identical and lower scores of 0.3 across all three metrics. This 33% relative boost in ranking the correct
entity underscores the value of tailored few-shot examples and optimized prompt templates for factoid
extraction in small-training scenarios.</p>
          <p>For List questions, systems must return a list of entity names, and performance is evaluated using
Mean Precision, Recall, and F-Measure against a golden list. Mean F-Measure is the oficial measure.</p>
          <p>According to Table 5, sinai-uja-v3 achieved the highest Mean Precision (0.2331) and F-Measure
(0.2667), indicating better accuracy in the entities it returned. System sinai-uja-v1, however, achieved
the highest Recall (0.5104), suggesting it was better at identifying more of the golden list entities, even
if its precision was lower. System sinai-uja-v2 performed in between the other two for Precision
and F-Measure but had the lowest Recall. The prompt optimization in sinai-uja-v3 appears to have
improved the precision of the list of entities returned.</p>
          <p>The results for the list question answering reveal a trade-of between precision and recall across our
system variants. While sinai-uja-v3, with its optimized prompt, demonstrates the highest precision and
F-Measure, indicating a greater accuracy in the returned entities, sinai-uja-v1 exhibits the highest recall.
This suggests that despite potentially including more irrelevant entities (lower precision), sinai-uja-v1
was more successful in identifying a larger proportion of the entities present in the gold-standard lists.</p>
          <p>This higher recall in sinai-uja-v1, despite its less refined prompt and use of an older PubMed index,
could be attributed to a broader or more exhaustive generation strategy. Perhaps the less constrained
prompt in v1 encouraged the LLM to generate a wider range of potential entities, increasing the chances
of overlap with the gold standard, even if it also led to the inclusion of more false positives. Conversely,
the prompt optimization in sinai-uja-v3, while improving the accuracy of the generated list, might
have inadvertently led to a more focused generation, potentially missing some relevant entities present
in the gold standard.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Evaluation of ideal answers</title>
          <p>Ideal answers, which are paragraph-sized summaries, were evaluated using both automatic and manual
metrics.</p>
          <p>Automatic evaluation of ideal answers (Table 6) was performed using ROUGE metrics, specifically
ROUGE-2 (R-2) and ROUGE-SU4 (R-SU4), which measure the overlap of bigrams and skip-bigrams
(with a maximum skip distance of 4), respectively, between the generated answer and a set of reference
texts (including golden answers and relevant snippets).</p>
          <p>R-2 and R-SU4 are reported as Recall (Rec) and F1 scores. System sinai-uja-v1 surprisingly achieved
the highest scores across all automatic ROUGE metrics (R-2 Rec 0.2093, R-2 F1 0.2047, R-SU4 Rec 0.2169,
R-SU4 F1 0.2126). Systems sinai-uja-v2 and sinai-uja-v3 performed slightly lower, with sinai-uja-v3
showing a minor edge in R-SU4 F1 over sinai-uja-v2.</p>
          <p>This suggests that the ideal answers generated by sinai-uja-v1 had a higher n-gram and skip-bigram
overlap with the reference texts according to these automatic measures.</p>
          <p>Manual evaluation by biomedical experts (Table 7) provides a human perspective on the quality
of ideal answers, assessing Readability, Information Recall, Information Precision, and Information
Repetition on a 1-5 scale.</p>
          <p>Crucially, these expert judgments paint a diferent picture, in contrast to the automatic evaluation,
the most optimized system sinai-uja-v3, received the highest average scores from the experts across
all manual criteria (Readability 4.24, Recall 4.27, Precision 3.96, Repetition 4.27).</p>
          <p>System sinai-uja-v2 had slightly lower scores than sinai-uja-v3 but was generally rated higher
than sinai-uja-v1, except for Information Precision where sinai-uja-v1 scored slightly higher.</p>
          <p>This discrepancy between automatic and manual evaluation highlights that while ROUGE metrics
measure surface-level overlap, expert judgment captures the nuances of information accuracy,
completeness, coherence, and fluency that are critical for high-quality biomedical summaries. The prompt
language optimization in sinai-uja-v3 seems to have significantly improved the perceived quality of
the ideal answers by biomedical experts.</p>
          <p>In summary, the updated PubMed dataset used by sinai-uja-v2 and sinai-uja-v3 led to improved
document retrieval performance compared to sinai-uja-v1. For exact answers, the prompt language
optimization in sinai-uja-v3 resulted in better performance for Factoid and List questions and matched
the performance of sinai-uja-v1 for Yes/No questions, outperforming sinai-uja-v2. While automatic
metrics favored sinai-uja-v1 for ideal answers, manual evaluation by experts clearly indicated that the
ideal answers generated by the optimized sinai-uja-v3 system were of higher quality.</p>
          <p>Overall, sinai_uja_RAG (v3) achieves the best end-to-end performance on Synergy Task 13,
demonstrating that the combination of up-to-date retrieval, entity-driven query formation, and carefully tuned
few-shot prompts yields the most efective biomedical QA system in this interactive, feedback -driven
framework.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Error analysis</title>
      <p>While the quantitative results presented in Section 4 demonstrate the overall strengths and weaknesses
of our three system variants, a deeper look at the specific types of errors made by each component
is essential to guide future improvements. In this section, we systematically compare the outputs of
our best-performing pipeline (sinai_uja_RAG, v3) and its predecessors against the “golden” reference
answers and expert feedback from the most recent Synergy round (Round 4). In total, we will analyse
55 questions, the ones that either had a golden-marked ideal answer or exact answer.</p>
      <p>By examining discrepancies at each stage—document retrieval, snippet selection, ideal answer
generation, and exact answer extraction; we aim to uncover common failure modes (e.g., missing key
entities, over-reliance on noisy contexts, prompting ambiguities) and quantify their impact on
end-toend performance. We begin by aligning each system’s retrieved PMIDs and snippet overlaps with the
gold-standard set, then analyze patterns in incorrect “yes/no” and factoid outputs, and finally inspect
qualitative feedback on ideal answers to identify recurring issues in coherence and factual completeness.</p>
      <p>This error analysis not only elucidates where and why our models falter but also provides actionable
insights for refining our multi-stage biomedical QA pipeline.</p>
      <sec id="sec-5-1">
        <title>5.1. Answer error analysis by question type</title>
        <sec id="sec-5-1-1">
          <title>5.1.1. Yes/no questions</title>
          <p>Across all three versions, our system answered 10 “yes/no” questions in the exact-answer task. System
v3 achieved perfect accuracy on 8 of these, but failed on two cases—an overall 80% exact accuracy for
“yes/no” type. Both errors share a common pattern, they correspond to false positives on “no” questions:
• 63adca9ec6c7d4d31b00001d: “Is Cinpanemab efective for Early Parkinson?s Disease?"
• 6593debd06a2ea257c00001f : “Are there new European Union formal eforts to increase the number
of clinical trials aimed at improving the mental health of children?"</p>
          <p>In both instances, the provided snippets did not explicitly afirm the queried claim. Instead, they
described related phenomena (e.g., clinical trials for hormonal contraception; mental-health studies in
children) without supporting the specific assertion.</p>
          <p>If we take a look at the snippets retrieved and the chain-of-thought generated by the model, it is clear
that the model’s tendency to infer a positive conclusion from loosely related or partially overlapping
contexts appears to be the root cause.</p>
          <p>The first misstep occurred with the question, “Are there new European Union formal eforts to
increase the number of clinical trials aimed at improving the mental health of children?”. Although the
available snippets described a variety of important studies—a resilience-building RCT for young carers,
a school-based mental-health program (PROMEHS), and epilepsy-focused trials involving children with
comorbid mental-health disorders—nowhere did they reference an EU-wide mandate or policy explicitly
intended to scale up pediatric clinical trials.</p>
          <p>Listing 3: Snippets from correctly retrieved documents for question ’6593debd06a2ea257c00001f’
{
’33669796’ = ‘‘It is estimated that 4-8% of youth in Europe carry out substantial care
for a family member or significant other. To prevent adverse psychosocial
outcomes in young carers (YCs), primary prevention resilience building
interventions have been recommended. We describe the study protocol of an
international randomized controlled trial (RCT) of an innovative group
intervention designed to promote the mental health and well-being of adolescent
YCs (AYCs) aged 15-17. The RCT will be conducted in six European’’,
’36360389’: ‘‘The challenges of today’s society demand high levels of socio-emotional
skills in children and adolescents; therefore, mental health is an important issue
to be addressed and promoted in schools. The present study aims to investigate
the effectiveness of a school mental health program (Promoting Mental Health at
Schools; PROMEHS) designed to promote socio-emotional learning and prevent
psychosocial difficulties in children and adolescents. The study was conducted on
a sample of 1392 students"</p>
          <p>Instead, our v3 pipeline treated these investigator-driven research projects as if they were coordinated
EU initiatives, and confidently answered “yes.” In reality, these are valuable individual studies, but they
fall short of constituting a formal, Europe-level efort to increase trial volume.</p>
          <p>A similar error pattern arose with Cinpanemab’s assessment in early Parkinson’s disease. The
model’s chain of thought highlighted the antibody’s binding to aggregated alpha-synuclein and the
dose-dependent biomarker changes observed in the SPARK trial. These pharmacodynamic results are
encouraging indicators of target engagement, but they do not equate to clinical eficacy. There were no
reported improvements in patient motor scores or disease progression metrics in the snippets provided.
Nevertheless, the model conflated biomarker success with treatment efectiveness, again answering
“yes” when the correct response was “no”.</p>
          <p>Listing 4: Chain-of-thought generated for question ’63adca9ec6c7d4d31b00001d’
’chain-of-thought’: To determine if Cinpanemab is effective for early Parkinson’s
disease, I analyzed the provided article fragments focusing on the specific
details about Cinpanemab’s mechanism of action, its ability to bind to
alphasynuclein, and any clinical trial data mentioned. Cinpanemab is described as a
human-derived monoclonal antibody that targets aggregated alpha-synuclein, a key
component in the pathogenesis of Parkinson’s disease. The study highlights its
ability to form stable complexes with alpha-synuclein in cerebrospinal fluid,
indicating engagement with the target in the central nervous system. Furthermore
, the evaluation of biomarker results from the Phase 2 SPARK clinical trial
shows a significant correlation between Cinpanemab dosage and binding to
alphasynuclein, suggesting that the drug behaves as intended within the CNS. Although
the overall assessment of Cinpanemab’s efficacy in treating early Parkinson’s
disease requires further investigation beyond these findings, the available data
indicate promise for its potential as a disease-modifying therapy.
{
}
{</p>
          <p>Listing 5: Factoid exact answer for question ’6593de2f06a2ea257c00001e’
‘‘question’’: ‘‘What is the burden of human trypanosomiases in European hospitals?’’
,
‘‘golden’’: ‘‘Spain has the highest burden of Chagas disease in Europe.’’,
‘‘results’’: [</p>
          <p>{"v1’’: ‘‘Imported cases’’, ‘‘bleu’’: 0, ‘‘bertscore’’: (0.4, 0.2, 0.27)},
Both errors underscore an overly permissive inference strategy: in the absence of an explicit statement,
the model infers the most optimistic interpretation. To address this, our future prompts should require
the model to pinpoint a direct quote or clearly labelled policy statement before granting a “yes” .</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Factoid questions</title>
          <p>
            Our systems encountered varied success on the factoid questions, where the goal is to extract a concise,
factual answer. To quantitatively assess performance for specific questions, we employed BLEU [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] and
BertScore [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] metrics, comparing the generated answers against the gold standards. This qualitative
analysis of specific errors provides further insights into the systems’ limitations.
          </p>
          <p>One notable error occurred for the question, “What is the burden of human trypanosomiases in
European hospitals?” (id: 6593de2f06a2ea257c00001e). The gold answer specified “Spain has the highest
burden of Chagas disease in Europe” as shown in Listing 5 while our systems (v1, v2, and v3) returned
“Imported cases” or “African trypanosomiasis,” resulting in a BLEU score of 0. The BertScore also
indicated low similarity. This suggests a failure in correctly identifying the specific type and geographical
distribution of trypanosomiasis requested. The retrieved snippets likely discussed trypanosomiases in a
broader context, and the model failed to pinpoint the European-specific burden and the prominence of
Chagas disease in Spain.</p>
          <p>{"v2’’: ‘‘African trypanosomiasis’’ ‘‘bleu’’: 0, ‘‘bertscore’’: (0.37, 0.2,
0.26)},
{"v3’’: ‘‘African trypanosomiasis’’ ‘‘bleu’’: 0, ‘‘bertscore’’: (0.37, 0.2,
0.26)},</p>
          <p>Conversely, for the question “What is the percentage of women that have successfully undergone
fertility treatment in the European Union?” (id: 6593d3ab06a2ea257c00001a), all our systems correctly
identified “24%,” which closely matches the gold answer “24.” While the BLEU score was 0 due to
the presence of the percentage symbol, the high BertScore indicated strong semantic similarity. This
highlights a limitation of exact-match metrics like BLEU in cases where the generated answer is
semantically correct but difers slightly in formatting.</p>
          <p>Another interesting case involved the question “What biological process is associated with Vitamin
K?". The gold answers comprised a list of several processes, including “Blood coagulation,” “Bone
metabolism,” and “protection against oxidative stress.” Our systems successfully identified some of
these (e.g., “anti-inflammatory activity,” “protection against oxidative stress,” “blood coagulation,” “bone
metabolism”).</p>
          <p>However, they also generated non-gold answers like “brain development” (v1, v2) and “activation of
PXR signaling pathway” (v3), leading to a lower overall performance score. This indicates that while the
systems could retrieve some relevant information, they also included plausible but ultimately incorrect
biological processes, highlighting a potential issue with over-generation or the inclusion of less directly
supported information.</p>
          <p>Listing 6: Factoid exact answer for question ’6772c765592fa48873000009’
‘‘question’’: ‘‘What biological process is associated with Vitamin K?’’,
‘‘golden’’: [’Blood coagulation’, ’Bone metabolism’, ’Bone Health Maintenance’, ’
calcium metabolism’, ’Inhibition of arterial calcification’, ’Nervous System
Function’, ’anti-inflammatory activity’, ’protection against oxidative stress’],</p>
          <p>On the other hand, we observed strong performances on questions like “What is the best
noninvasive method to diagnose endometriosis?” (id: 6777b471592fa4887300000c) where all systems correctly
answered “transvaginal ultrasonography,” achieving a high BLEU score. However, the systems failed to
identify the other gold answer, “MRI,” suggesting a limitation in retrieving the full spectrum of correct
answers.</p>
          <p>Similarly, for “What disease is associated with chalk-stick fracture?” (id: 63adcb54c6c7d4d31b00001f ),
all systems correctly answered “ankylosing spondylitis” resulting in a perfect BLEU score, despite a
seemingly low BertScore which might be less reliable for short, exact answers. Lastly, for “What disease
can be treated with vorasidenib?” (id: 677e8319592fa4887300001e), systems v2 and v3 perfectly matched
the gold answer “IDH-mutant gliomas,” while v1 returned “IDH-mutant glioma,” resulting in a slightly
lower BLEU score. This minor variation likely stems from the prompt diferences and the slightly
diferent document set used by v1.</p>
          <p>Overall, the factoid question analysis reveals that while our RAG-based system can often extract
correct and precise answers, it still struggles with nuanced questions requiring specific geographical or
contextual information, and can sometimes over-generate plausible but incorrect answers. Future work
will focus on refining the prompting strategies and retrieval mechanisms to improve the accuracy and
completeness of factoid answer extraction.</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>5.1.3. List questions</title>
          <p>The evaluation of our systems on list questions, where a set of entities is expected as the answer, reveals
several instances of poor performance as indicated by BLEU scores of 0. Examining these cases provides
insights into the challenges our RAG-based approach faces in generating comprehensive and accurate
lists.</p>
          <p>For the question “Which are the most common psychiatric events associated with the consumption of
cannabis?” (id: 63ac44c2c6c7d4d31b000011), all three system variants returned a consistent list: [’panic
attack psychotic states dependence abuse cognitive disorders amotival syndrome anxiety disorder
suicidal idea and attempt Hallucinogenic efects Stimulant efects Aggressive behavior Addictive behaviors
Bipolar Disorder’].</p>
          <p>As shown in Listing 7, when compared against individual gold standard entities like ’psychosis’,
’paranoia’, and ’mood disorders’, the BLEU score is 0 across all systems. This suggests that while the
generated list contains several relevant terms, it might be too broad or include terms not considered
the most common by the gold standard. The low BertScore further supports a lack of close semantic
overlap with the specific gold entities.</p>
          <p>Listing 7: List exact answer for question ’63ac44c2c6c7d4d31b000011’
‘‘question’’: ‘‘Which are the most common psychiatric events associated with the
consumption of cannabis?’’,
‘‘golden’’: ["psychosis’’, ‘‘paranoia’’, ‘‘mood disorders"],
‘‘results’’: [
{"v1’’: ["panic attack psychotic states dependence abuse cognitive disorders
amotival syndrome anxiety disorder suicidal idea and attempt Hallucinogenic
effects Stimulant effects Aggressive behavior Addictive behaviors Bipolar
Disorder"], ‘‘bleu’’: 0, ‘‘bertscore’’: (0.3134, 0.1506, 0.2034)},
{"v2’’: ["panic attack psychotic states dependence abuse cognitive disorders
amotival syndrome anxiety disorder suicidal idea and attempt Hallucinogenic
effects Stimulant effects Aggressive behavior Addictive behaviors Bipolar
Disorder"], ‘‘bleu’’: 0, ‘‘bertscore’’: (0.3134, 0.1506, 0.2034)},
{"v3’’: ["panic attack psychotic states dependence abuse cognitive disorders
amotival syndrome anxiety disorder suicidal idea and attempt Hallucinogenic
effects Stimulant effects Aggressive behavior Addictive behaviors Bipolar
Disorder"], ‘‘bleu’’: 0, ‘‘bertscore’’: (0.3134, 0.1506, 0.2034)}</p>
          <p>A similar pattern of a lengthy generated list failing to precisely match specific gold entities is
observed for the question “Which tools are used to predict mortality in paediatric sepsis?” (id:
658447ca06a2ea257c000001). All systems returned an extensive list of potential predictors. However,
when evaluated against individual gold entities like ’platelet count’, ’SIRS’, ’NEWS’, ’NGAL’, and
’PT/INR’, the BLEU score is consistently 0. This suggests that while the systems correctly identified
various relevant factors, they might have included additional less critical or less established predictors,
leading to a mismatch with the specific entities prioritized in the gold standard.</p>
          <p>Continuing with the ongoing trend, the question “Most common physical signs of self-harm in
pre-teenagers” (id: 6777b5e1592fa4887300000d) resulted in the systems listing ’cutting’, ’scratching’,
and ’burning’. Compared to broader terms like ’skin tearing’, ’wound-healing hindrance’, and ’striking
objects’, the BLEU score is 0, indicating that the experts where aiming for more general answers rather
than the specific ones generated.</p>
          <p>These instances of low BLEU scores on list questions highlight a recurring challenge: our systems
often generate lists of specific entities that, while potentially relevant, do not perfectly align with the
level of abstraction or the exact set of entities prioritized in the gold standard.</p>
          <p>This could be due to limitations in the retrieval process, the prompt’s influence on the specificity
of the generated list, or the inherent dificulty in precisely matching the granularity and scope of
expert-curated lists. Future eforts should focus on refining the system’s ability to generate lists that
balance specificity with broader categories and ensure a closer match to the expected entities.</p>
        </sec>
        <sec id="sec-5-1-4">
          <title>5.1.4. Summary questions and all ideal answers</title>
          <p>For a more holistic error analysis, we also examined the ideal answers generated by our best-performing
system (v3) across all question types (factoid, yes/no, list, and summary). These paragraph-length
answers aim to synthesize the key information from the retrieved snippets.</p>
          <p>We evaluated these generated ideal answers using the same metrics as before, Bleu and BertScore.
Ideal answers for yes/no, factoid and list questions</p>
          <p>Analyzing the ideal answers generated for non-summary question types reveals both strengths and
weaknesses in the system’s summarization capabilities.</p>
          <p>The factoid question seen before with id ’6593d3ab06a2ea257c00001a’ asked for the percentage of
women with successful fertility treatment in the EU. Our system’s ideal answer, “The percentage of
women that have successfully undergone fertility treatment in the European Union is 14%,” achieved
a low BLEU score of 0.346. This suggests a factual inaccuracy compared to the expected summary,
which likely highlighted the reported 24% success rate. The low BertScore results (0.2490, 0.2071, 0.2142)
further indicates a lack of semantic similarity. This error likely stems from the system extracting or
synthesizing an incorrect numerical value from the retrieved documents.</p>
          <p>Listing 8: Ideal answer for question ’6772c615592fa48873000007’
‘‘question’’: ‘‘Is morphomics associated with the study of body image or body
dysmorphia?’’,
‘‘golden’’: [
’Morphomics is not associated with the study of body image or body dysmorphia.</p>
          <p>Instead, it is focused on analyzing body composition, specifically through
medical imaging techniques like CT scans, to assess and predict health
outcomes related to body fat and muscle content.’,
’No, morphomics is not associated with the study of body image or body
dysmorphia. It involves analyzing 3D medical imaging data to assess
anatomical features, often for predicting health outcomes or assessing risk
in medical situations.’,
’Morphomics involves the study and analysis of body composition through medical
imaging techniques, primarily focusing on biological markers in muscle and
adipose tissues. It does not concern itself with psychological aspects such
as body image or body dysmorphia.’
‘‘v3’’:
‘‘While morphomics is occasionally referenced in discussions about body
image or body dysmorphic disorders, its main focus lies in the
objective measurement and analysis of body composition through
medical imaging, making it less directly associated with the study of</p>
          <p>body image or body dysmorphia compared to other fields. Its primary
goal is to quantify and analyze anatomical features, not to address
subjective perceptions of body image.’’,
‘‘bleu’’: 0.5502,
‘‘bertscore’’: (0.2712, 0.1739, 0.2076)</p>
          <p>"Is morphomics associated with the study of body image or body dysmorphia?” (id:
6772c615592fa48873000007) was another yes/no question posed, the correct answer with respect with
the feedback is “No". While the generated ideal answer correctly concludes that morphomics is not
primarily associated with body image, the initial clause as seen in Listing 8 introduces unnecessary
nuance and potential confusion that contradicts the direct “no” implied by the gold snippets. This
suggests a tendency for the model to provide overly verbose or hedging answers even when the source
information is direct.</p>
          <p>On the other hand, question “Are there clinical trials on hormonal male birth control methods?” (id:
6593d46c06a2ea257c00001b) was precisely the one marked by the BioASQ team as one of the golden
answers in the feedback file used for this analysis. As shown in Listing 9, the system accurately extracted
and synthesized the key details regarding the existence and focus of clinical trials on hormonal male
contraception, demonstrating a strong ability to summarize factual information for yes/no questions.</p>
          <p>Listing 9: Ideal answer for question ’6593d46c06a2ea257c00001b’
Answers for summary questions</p>
          <p>In the realm of summary questions, we observed both successes and shortcomings in our system’s
ability to generate concise and accurate summaries based on the retrieved information. For the question
concerning Gene Set Enrichment Analysis (GSEA) (id: 677ecc12592fa48873000028), the generated
summary included details about statistical significance and database integration, which were not the primary
focus of the expert-provided concise explanation of GSEA’s core function. This suggests a tendency
to include potentially extraneous information. Similarly, for the question about Over-Representation
Analysis (ORA) (id: 677ed8b8592fa4887300002b), the system’s summary, while capturing the main idea,
also incorporated a discussion of ORA’s limitations, which deviated from the purely descriptive nature
of the first expert snippet and only partially aligned with the second.</p>
          <p>On the other hand, for the summary question regarding gender-afirming care in minors “Should
gender-afirming surgery be performed in people under 18 years of age?” (id: 6593d2e006a2ea257c000019),
the generated ideal answer efectively captured the cautious yet potentially beneficial stance, reflecting
the nuances and emphasis on individualization found in the expert feedback. Similarly, for the
question about the adverse efects of rare earth elements (REEs) (id: 6772e791592fa4887300000b),
the system provided a comprehensive summary listing various health risks associated with REEs,
aligning well with the diverse information presented in the expert snippets. These examples of high
BLEU scores indicate the system’s capability to synthesize coherent and relevant summaries when
the source information is well-aligned and the task is to consolidate multiple related facts or perspectives.</p>
          <p>Error overview</p>
          <p>The overall performance of our systems, as depicted in Figures 3a to 7 that showcase the avegare
performance of our system across each question and answers types; reveals varying degrees of success
across diferent question and answer types. For exact answers, system v1 and v2 achieved a perfect BLEU
score of 0.9 for Yes/No questions, slightly outperforming v3 (0.8), while their BertScore F1 remained
consistently high (around 0.59) across all variants (Figure 3a). In Factoid exact answers (Figure 3b), v1
and v2 showed marginally higher BLEU scores (0.46-0.468) compared to v3 (0.416), with BertScore F1
scores remaining very close across all systems (around 0.42-0.43). A notable improvement was observed
in List exact answers (Figure 3c), where v3 achieved a significantly higher BLEU score (0.31) compared
to v1 (0.135) and v2 (0.205), indicating that the prompt optimizations in v3 were particularly efective for
generating more accurate lists, even if the BertScore F1 remained relatively consistent (around 0.38-0.39)
across the variants.</p>
          <p>1.0
0.8
e
r
o
cS0.6
e
g
a
r
e
vA0.4
0.2</p>
          <p>Turning to ideal answers, the BLEU scores are generally higher than for exact answers, reflecting
the greater flexibility in phrasing for longer generated texts. For Yes/No ideal answers (Figure 4), v2
showed the strongest performance with a BLEU score of 0.88, surpassing both v1 (0.758) and v3 (0.782),
while BertScore F1 scores were consistently around 0.23. In Factoid ideal answers (Figure 5), v1 led with
a BLEU score of 0.777, followed by v3 (0.701) and v2 (0.651). For List ideal answers (Figure 6), v3 and v2
performed similarly and slightly better than v1 in terms of BLEU (0.801 and 0.800 respectively vs. 0.767).
Finally, for Summary ideal answers (Figure 7), v1 achieved the highest BLEU score (0.716), with v3 close
behind (0.702) and v2 performing slightly lower (0.64). Across all ideal answer types, the BertScore F1
remained relatively low (around 0.17-0.23), suggesting that while the generated ideal answers share
n-gram overlap with the gold standards, there might be room for improvement in semantic precision or
conciseness as perceived by BertScore.</p>
          <p>Overall, these metrics highlight that no single system variant consistently outperformed the others
across all metrics and question types, indicating that the subtle changes in prompts and data snapshots
had specific, rather than universal, impacts on performance.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we presented a comprehensive Retrieval-Augmented Generation (RAG) system designed
to address the dynamic and challenging biomedical semantic Question Answering tasks within the
BioASQ Task Synergy 13. Our multi-stage pipeline integrated a preliminary question analysis with an
entity extraction methodology for query enhancement, a robust semantic search for document and
snippet retrieval, and a fine-tuned Llama-based Large Language Model for answer generation across
various formats. This work underscored the critical role of interactive expert feedback in refining
systems for evolving biomedical information.</p>
      <p>Our evaluation revealed several key insights into the performance of RAG architectures in this
demanding domain. While our initial document retrieval performance, though improved by updating
the PubMed corpus, remained modest overall, our systems demonstrated exceptional capabilities in
snippet extraction. Notably, our RAG-based pipelines achieved the highest snippet extraction scores in
Round 4, significantly outperforming other competitors. This highlights the strength of our dynamic
indexing and semantic matching strategy in identifying the most relevant textual evidence from a given
set of documents, even if the initial document pool was not perfectly optimized.</p>
      <p>The analysis of exact answers showed varied performance across question types. For Yes/No questions,
our systems achieved high accuracy, though with a tendency for overly permissive inference, leading
to false positives when explicit confirmation was absent. In Factoid questions, we observed a mixed
bag, with successes in extracting precise facts but also challenges in handling nuanced geographical or
contextual information, and a propensity for over-generation of plausible but incorrect answers. For
List questions, our prompt optimizations in system v3 led to improved precision and F-Measure, though
system v1 surprisingly maintained the highest recall, suggesting a trade-of between specificity and
coverage.</p>
      <p>Regarding ideal answers, our systems generally produced coherent and informative summaries,
especially for Yes/No and some List questions, achieving high BLEU scores. However, the error analysis
identified areas for improvement, including issues with numerical accuracy, the introduction of
extraneous details, and occasional over-synthesis of information not strictly central to the expert’s ideal
summary. The consistent, albeit low, BertScore F1 across ideal answers suggests room for enhancing
semantic precision and conciseness.</p>
      <p>In conclusion, our participation in BioASQ Task Synergy 13 provided invaluable insights into the
practical challenges and opportunities of applying RAG to real-world, dynamic biomedical QA. The
results confirm the efectiveness of our snippet retrieval approach and the potential of LLMs in
synthesizing complex information. Future work will focus on refining our query enhancement strategies to
improve initial document retrieval, developing more robust prompt engineering techniques to balance
precision and recall in exact answer generation, and enhancing the LLM’s ability to produce more
concise and precisely targeted ideal answers by mitigating over-generation and improving numerical
accuracy. These eforts will further strengthen our RAG pipeline, moving towards more robust and
reliable biomedical question answering systems.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was partly supported by the grants FedDAP (PID2020-116118GA-I00), MODERATES
(TED2021-130145B-I00), SocialTOX (PDC2022-133146-C21) and CONSENSO (PID2021-122263OB- C21)
funded by MCIN/AEI/10.13039/501100011033, “ERDF A way of making Europe” and “European Union
NextGenerationEU/PRTR”. This work was also funded by the Ministerio para la Transformación Digital
y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU –
NextGenerationEU within the framework of the project Desarrollo Modelos ALIA.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT and, occasionally, Perplexity to assist
with LaTeX syntax as well as grammar and spelling checks in English. After using these tools, the
author(s) carefully reviewed and edited the content as needed and take(s) full responsibility for the
publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. N. Maria</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Nunzio</surname>
          </string-name>
          , Giorgio,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering</article-title>
          , in: L.
          <string-name>
            <surname>P. A. G. S. d. H. J. M. F. P. P. R. D. S. G. F. N. F. Jorge Carrillo-de Albornoz</surname>
          </string-name>
          , Julio Gonzalo (Ed.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Paliouras, Overview of BioASQ Tasks 13b and Synergy13 in CLEF2025</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bougiatiotis</surname>
          </string-name>
          , G. Paliouras,
          <string-name>
            <surname>BioASQ-QA</surname>
          </string-name>
          :
          <article-title>A manually curated corpus for Biomedical Question Answering</article-title>
          ,
          <source>Scientific Data</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>170</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.-W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Generative large language models augmented hybrid retrieval system for biomedical question answering</article-title>
          ,
          <source>CLEF Working Notes</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartshorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sravankumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korenev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hinsvark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G.</surname>
          </string-name>
          et al.,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2407</volume>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Gururajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lopez-Cuena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bayarri-Planas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tormos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hinjos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bernabeu-Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Arias-Duart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Martin-Torres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Urcelay-Ganzabal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gonzalez-Mallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Alvarez-Napagao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ayguadé-Parra</surname>
          </string-name>
          , U.
          <string-name>
            <surname>C. D.</surname>
          </string-name>
          Garcia-Gasulla,
          <article-title>Aloe: A family of fine-tuned open healthcare llms</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <fpage>2405</fpage>
          .
          <year>01886</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Cano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Romanou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Matoba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Salvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pagliardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Köpf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohtashami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sallinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhaeirad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Swamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Krawczuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bayazit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marmet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Montariol</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Hartley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Jaggi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bosselut</surname>
          </string-name>
          , Meditron-70b:
          <article-title>Scaling medical pretraining for large language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2311.16079. arXiv:
          <volume>2311</volume>
          .
          <fpage>16079</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W. tau Yih, T. Rocktäschel,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <year>2005</year>
          .11401.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Barron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Grantcharov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Eren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bhattarai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Solovyev</surname>
          </string-name>
          , G. Tompkins,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nicholas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rasmussen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Matuszek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. S.</given-names>
            <surname>Alexandrov</surname>
          </string-name>
          ,
          <article-title>Domain-specific retrieval-augmented generation using vector stores, knowledge graphs, and</article-title>
          tensor factorization,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2410.02721. arXiv:
          <volume>2410</volume>
          .
          <fpage>02721</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kilicoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Biomedrag:
          <article-title>A retrieval augmented large language model for biomedicine</article-title>
          ,
          <source>Journal of Biomedical Informatics</source>
          <volume>162</volume>
          (
          <year>2025</year>
          )
          <article-title>104769</article-title>
          . URL: https://www. sciencedirect.com/science/article/pii/S1532046424001874. doi:https://doi.org/10.1016/j. jbi.
          <year>2024</year>
          .
          <volume>104769</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Raza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Reji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bashir</surname>
          </string-name>
          ,
          <article-title>Large-scale application of named entity recognition to biomedicine and epidemiology</article-title>
          ,
          <source>PLOS Digital Health</source>
          <volume>1</volume>
          (
          <year>2022</year>
          )
          <article-title>e0000152</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Deka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jurek-Loughrey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Deepak</surname>
          </string-name>
          ,
          <article-title>Improved methods to aid unsupervised evidence-based fact checking for online health news</article-title>
          ,
          <source>Journal of Data Intelligence</source>
          <volume>3</volume>
          (
          <year>2022</year>
          )
          <fpage>474</fpage>
          -
          <lpage>504</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2201.11903. arXiv:
          <volume>2201</volume>
          .
          <fpage>11903</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          ,
          <year>2002</year>
          . doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with bert</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1904</year>
          .09675. arXiv:
          <year>1904</year>
          .09675.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>