<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BIT.UA at BioASQ 13B: Revisiting Evaluation, DPRF-Enhanced Retrieval and Fine-Tuned LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Richard A. A. Jonker</string-name>
          <email>richard.jonker@ua.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tiago Almeida</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>João R. Almeida</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sérgio Matos</string-name>
          <email>aleixomatos@ua.pt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IEETA/DETI, LASI, University of Aveiro</institution>
          ,
          <addr-line>Aveiro</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>N_5_3 O_5_3 Summ.</institution>
          <addr-line>all @ 3 (6) Summ. all @ 5 (6) Summ. of 2, 3, 2</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Biomedical information retrieval and question answering are critical for navigating the vast and continually expanding body of biomedical literature. The BioASQ Task B Challenge provides a valuable benchmark for developing and evaluating systems capable of retrieving relevant documents and generating high-quality answers to biomedical questions. This paper describes our participation in the thirteenth edition of the BioASQ challenge, focusing on Task B, which is based of our participation in the twelfth edition of the challenge. For Phase A, we employed a hybrid two-stage retrieval pipeline combining BM25-based retrieval with transformer-based rerankers such as BioLinkBERT and PubMedBERT. We showcased the performance Dense Pseudo Relevance Feedback (DPRF) using the BGE-M3 model to enhance retrieval. In Phase B, we used a range of large language models (LLMs) for answer generation, including OpenBioLLM, LLaMA Nemotron, and a custom fine-tuned Gemma-3 27B model. We also unified our summarization and ensembling strategy from last year into a single generation step to improve eficiency and coherence. A key insight from this year's participation was the persistent misalignment between automatic evaluation metrics and human-judged answer quality, a discrepancy that influenced both our system design and our interpretation of results in previous years. For phase A our systems consistently achieved top rankings. We discuss these outcomes in light of evaluation challenges and outline promising directions for future work. All code is publicly available at https://github.com/bioinformatics-ua/BioASQ13B.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Retrieval</kwd>
        <kwd>Dense Retrieval</kwd>
        <kwd>Semantic Search</kwd>
        <kwd>Large Language Model</kwd>
        <kwd>Answer Generation</kwd>
        <kwd>Pseudo Relevance Feedback</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>conclusions and optimizing for certain metrics, which did not translate into better human-evaluated
performance. For instance, runs achieving top ranks by automatic metrics were often rated poorly by
human evaluators, and vice versa. This discrepancy exposed fundamental limitations in relying on
automatic metrics alone to guide system development and evaluation.</p>
      <p>These misleading conclusions motivated a fundamental rethink of our methodology for the 13th
BioASQ challenge. We prioritized reproducibility and robustness, emphasizing consistent evaluation
across large batches and reducing reliance on potentially unreliable automatic metrics, aided by the fact
that automatic metrics were not released throughout the competition this year. Our approach shifted
towards less reliance on individual model performance and using a large variety of models in ensemble.
This year we also try to keep our submissions relatively consistent to truly gauge the performance.</p>
      <p>A central theme of our work this year is evaluative uncertainty: recognizing and addressing the
instability and unreliability of current automatic metrics, and exploring strategies that maintain
performance across batches and evaluation regimes. We see this as a necessary evolution for BioASQ and
similar challenges where generative quality is increasingly dificult to quantify automatically.</p>
      <p>For phase A, we refined retrieval strategies by adjusting the number of reranked documents to
balance performance and computational cost, as well as focusing more on DPRF. To address generation,
historically our weakest component, we significantly revamped our pipeline. We transitioned to faster
inference engines and expanded our model lineup to include LLaMA Nemotron, and OpenBioLLM,
along with a two-stage fine-tuning of the larger Gemma-3 27B model. We also unified ensembling and
summarization into a single generation step, improving answer coherence and eficiency.</p>
      <p>The rest of the paper is organized as follows. Section 2 describes our submissions from the previous
year, highlighting discrepancies between the conclusions we initially drew and those informed by the
ifnal human evaluations. Section 3 outlines the changes made to our system this year. Section 4 presents
both our internal validation results and batch-wise competition outcomes, discussing the performance
of individual systems. Section 5 ofers a critical analysis of our findings, including challenges related to
evaluation metrics. Finally, Section 6 summarizes our conclusions and outlines directions for future
work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Previous Work</title>
      <p>
        In our participation in BioASQ 12 Task B [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], our approach comprised a two-stage retrieval pipeline,
neural reranking, and retrieval-augmented generation (RAG) techniques. For the document retrieval
phase (Phase A), we implemented a hybrid strategy that began with first-stage retrieval using BM25
from the PISA framework [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] to eficiently filter candidate documents from the vast PubMed corpus.
This was followed by neural reranking using transformer-based models, specifically PubMedBERT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and BioLinkBERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Typically, we applied neural reranking to the top 1,000 documents retrieved via
BM25. To further enhance retrieval performance, we incorporated Dense Pseudo Relevance Feedback
(DPRF) through semantic search using BGE-M3 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] embeddings. The goal was to identify documents
semantically similar to the top 50-100 ranked by the neural reranker. The outputs from various models
were then combined using reciprocal rank fusion (RRF), which proved beneficial for producing robust
ifnal rankings.
      </p>
      <p>Our conclusions from this phase include:</p>
      <sec id="sec-2-1">
        <title>1. Large models and small models performed similarly.</title>
        <p>2. According to internal validation, higher data quality led to better performance than data quantity.</p>
        <p>This contradicted initial findings from preliminary results but was ultimately supported by the
oficial evaluation. This discrepancy highlights a case of misaligned automatic evaluation.
3. DPRF yielded performance improvements, although these were minimal.</p>
        <p>For answer generation (Phases A+ and B), we adopted a RAG framework. We provided the top five
retrieved documents as context to several LLMs, including Llama 3 70B, Nous-Hermes2-Mixtral, and a
ifne-tuned Gemma 2B model. The Gemma 2B model was fine-tuned on BioASQ training data using
Low Rank Adaption (LoRA). To ensure the relevance and accuracy of generated answers, we developed
answer selection mechanisms that used our neural reranker to evaluate candidate answers as
pseudodocuments. We also experimented with snippet-based context and implemented answer truncation
strategies to comply with the BioASQ 200-word limit. Finally, we used Mixtral to summarize model
outputs, producing more concise answers that aligned better with BioASQ evaluation criteria. Our
conclusions have evolved significantly since the release of the human evaluation results. A comparison
of rankings from last year (Phases A+ and B) is shown in Table 1 and Table 2. The conclusions for
Phases A+ and B are summarized as follows:
1. According to automatic metrics, the fine-tuned Gemma model achieved the second-best F1
score (system-3, phase A plus batch 2). However, in human evaluation, it was ranked only
9th—outperformed by one of our other systems, which ranked 17th in F1 and 1st in Recall.
Notably, our run which ranked 3rd in human evaluation (phase A plus, system-3, batch 4) had an
F1 rank of 20 and a recall rank of 23, illustrating a significant misalignment between ROUGE-based
automatic metrics and human judgments. Further, we tried optimizing for high recall last year,
which did not correlate with top performance in human evaluations.
2. Document sources did not play a significant role in performance (phase A plus). In the context of
a competition, it appears more efective to keep the document source constant and vary other
model or pipeline components.
3. Summarization techniques (phase A plus, system-3,4 batches 3 and 4, also in phase B) led to
modest F1 gains at the expense of recall. However, these changes ultimately translated into
substantial improvements in human evaluation scores.
4. In Phase B, the main diference was the use of snippets, which consistently yielded top recall
scores (all bolded recall results relied on snippets). However, this did not always translate into
high human evaluation scores.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Building on some of the conclusions from last year, along with new intuitions and assumptions, we
began refining our methodology. For Phase A, our main objectives were to (1) successfully reproduce
last year’s submission and (2) place greater emphasis on DPRF, which we had previously underutilized.
Due to these requirements, our first batch felt relatively weak upon submission, as we didn’t have
everything fully prepared in time and were only able to train five models. After submitting the first
batch, we focused on validating models and adjusting the data used to train the reranker models, as well
as setting up DPRF, which was not working in B1. Unlike last year, we approached this phase with a new
perspective: rather than focusing on top-performing models based on validation, we assumed that most
models would perform similarly and could all contribute meaningfully when used in an ensemble. As a
result, we prioritized training diverse models while maintaining relatively strong individual performance,
as demonstrated in the Validation Results section. From B2 to B4, the methodology remained largely
consistent, with one key change introduced in B3: we reduced the number of documents reranked by
the neural reranker from 1,000 to 100. A small validation test confirmed this had negligible impact on
performance. The rationale was that we lacked the time to run inference for all models and apply DPRF
across all 2025 models. Finally, in B4, we also experimented with incorporating our 2023 models into
the ensemble.</p>
      <p>
        Our main changes this year focused on generation, which had been our weakest component in
previous years—last year, we only achieved a single second-place ranking in one of the batches for
Phase A+. Phase A+/B is a rapidly evolving and challenging task to keep pace with, especially given our
reliance on LLMs as the core of our RAG pipeline. While many of our conceptual approaches remained
similar to last year, the implementation changed significantly. We switched our inference engine from
Ollama[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to LMDeploy [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], greatly increasing inference speed, and transitioned our model lineup to
include Llama-Nemotron-70B1 [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] , and OpenBioLLM2 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Following this, we experimented with
new prompting strategies and also restructured how we handle summarization and ensembling.
      </p>
      <p>In last year’s approach, we would ensemble multiple LLM outputs and then use one of our neural
rerankers to select the best answer for a given question. Summarization, at the time, was treated
independently and served only to compress model outputs into shorter texts. This year, we merged
ensembling and summarization into a single step. Specifically, we designed a summarization prompt
that takes multiple outputs from diferent models and combines them into one unified answer. This
approach allows us to achieve both ensembling and summarization within a single generation pass.</p>
      <p>
        The last major change was introduced in B4, where we revisited the use of fine-tuned models. This
year, we chose to fine-tune a significantly stronger and larger model—Gemma-3 27B 3 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The
finetuning process consisted of two stages. First, we fine-tuned the model on a custom dataset 4 using LoRA
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and Unsloth5, serving as an initial knowledge injection phase. In the second stage, we fine-tuned
the model specifically for RAG tasks using Prompt 5 and incorporating five document abstracts as
context. A more minor change involved further experimentation with the number of documents used
during generation. Regarding Phase A+, all of our runs relied on documents from our Phase A run 4
submission, assuming these were the strongest Phase A results. In Phase B, we continued to emphasize
the use of snippets, further exploring their role and efectiveness in the generation pipeline.
      </p>
      <p>The prompts used for generation generally follow similar styles, with Prompt 1 being the simplest
and producing the weakest answers. Prompt 2 is slightly more complex, while Prompts 3 to 5 introduce
higher complexity, eliciting reasoning steps and prompt 5 providing an example. In terms of summaries,
we ofered two prompts. These prompt changes occurred at the same stage as Phase A, with the primary
diference being the inclusion of an example. The prompts can be seen in Appendix A.
1https://huggingface.co/ibnzterrell/Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4
2https://huggingface.co/aaditya/Llama3-OpenBioLLM-70B
3https://huggingface.co/google/gemma-3-27b-it
4Dataset is currently under a double blind review
5https://docs.unsloth.ai/</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>
        In this section, we describe the diferent models and configurations we evaluated as part of our
participation in BioASQ 12 Task B. We outline the submissions made and report the oficial preliminary
results provided by the organizers. The focus is on both retrieval and generation components, with
performance assessed using standard automatic evaluation metrics. We also provide observations on
how diferent approaches and design choices influenced the overall outcomes. As a reminder the results
here present the oficial BioASQ metrics (MAP, ROUGE-2), as described in our previous work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <sec id="sec-4-1">
        <title>4.1. Validation Results</title>
        <p>Validation was conducted on the first and second batches of the 2024 dataset. For reference, our best
scores from these batches were 0.4142 for Batch 1 (B1) and 0.4412 for Batch 2 (B2), indicating that our
current validation results are competitive with the large ensembles we used last year. This section
presents validation results for Phase A (no validation was conducted for the generation task). The
validation was performed prior to the release of Batch 2, and a summary of the findings is shown in
Table 3. These results are based on an arbitrarily selected baseline model, with various hyperparameters
altered to observe their impact. Overall, we observed a 2-point spread between the best- and
worstperforming models, suggesting that most variations yielded relatively minor efects.</p>
        <p>In terms of model architecture, the base variants generally performed better their large counterparts,
with BioMedBERT-base achieving the best performance and BioMedBERT-large the worst. Among
training durations, 4 epochs appeared optimal. The use of warm-up also provided slight performance
improvements in some configurations. Looking at the data configurations, we found that using
highquality data led to better results than simply increasing data quantity. The ExPOS setting—which
expands positive samples using semantically similar but unannotated documents—did not improve
performance. While these documents may be relevant, their lack of annotation may have diluted their
usefulness. However, we note that the true value of ExPOS might only be measurable through manual
evaluation. Pointwise training continued to underperform relative to pairwise approaches. Increasing
the number of negative samples (KN) to 2 showed a modest improvement. Experimentation with
alternative sampling strategies, such as ‘basicv2‘ and ‘exponential‘, did not yield significant gains. In
summary, Phase A validation primarily served to eliminate poorly performing configurations from
BioLinkBERT-base
BioLinkBERT-large
BiomedBERT-base
BiomedBERT-large
BioLinkBERT-base
BioLinkBERT-base</p>
        <p>BioLinkBERT-base
BioLinkBERT-base</p>
        <p>BioLinkBERT-base</p>
        <p>BioLinkBERT-base</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Oficial Results - Phase A</title>
        <p>A summary of the systems submitted across each Phase A batch is presented in Table 4, with the
performance results shown in Table 5. Each system varies in the combination of models used, including
diferent training epochs, samplers, and ensemble strategies. Some techniques were developed between
batches, which explains the evolving configurations across submissions, however we tried to keep most
things stable between batches, especially since we did not get any intermediate results between batches.</p>
        <p>In Batch 1, we submitted five systems, primarily composed of a small set of newly trained 2025 models
(e.g., PubMedBERT and BioLinkBERT variants) and previously used 2024 systems. Systems 0 and 1
focused on pairwise models, with System-2 containing all the 2025 models and System-3 containing all
the 2024 models. System-4—an ensemble of both 2025 and 2024 systems—achieved the best MAP (42.46),
obtaining a slightly higher MAP than System-2 (42.14). System-1 had a marginally better precision
than the others (10.71). Notably, all our systems ranked among the top 5, outperforming both the best
external competitor (MAP: 38.06) and the median submission (MAP: 27.16).</p>
        <p>In the second batch, we fixed a number of issues present in the first batch and also introduced DPRF.
System-1, which applied DPRF to most of our 2025 models, achieved the best MAP (43.07) and precision
(11.65) among our submissions, placing 3rd overall. Interestingly, System-0, which contained all of
our newly trained 2025 models, obtained the same MAP as our 2025 models trained solely on ExPOS
data. This suggests that the ExPOS data may, in fact, be beneficial. Systems 3 and 4 were our large
ensembles combining 2025 and 2024 models. In this batch, we observed that including DPRF-enhanced
models in the large ensemble actually resulted in worse performance compared to using only the base
2025 models, even though the DPRF models alone had a better performance than the base runs. This
indicates that the ensemble strategy may need further refinement.</p>
        <p>In Batch 3, we retained most of the pipeline but reduced the number of reranked documents from 1000
to 100 to enable DPRF on all 2025 models. We also updated System-2 to use all our 2024 submissions.
Once again, DPRF was the best-performing approach, with the 2024 models surprisingly coming in
second. This result is somewhat concerning, as we would expect the newer 2025 models to outperform
the 2024 models, even with a smaller ensemble size. For the large ensembles, System-4 had a slight
increase in MAP than System-3, which is a step in the right direction.</p>
        <p>In the final batch, we experimented with including the 2023 runs in a large ensemble, as we were
curious to see how well older models would perform. In general, we observed that the 2025 models
with DPRF again achieved the best rank. However, the 2025 models, when included in larger ensembles,
remained consistently behind the DPRF-only systems. Among the ensemble configurations, none
outperformed Systems 0 and 1. Nevertheless, we note that the ensemble including the 2023 models did
perform slightly better than others.</p>
        <p>In general, the main conclusions we can draw from these automatic results (pending updates to the
gold standard) are as follows: DPRF appears conclusively better this year and represents a promising
step forward. Everything else remains relatively inconclusive. The weakness of the ensembles may
stem from changes in how we performed ensembling: instead of conducting a single RRF, we grouped
models into sub-ensembles (e.g., creating an ensemble of two sub-ensembles). The intuition behind this
was to better balance the weights of the model sets—ensuring that the fewer, but newer, 2025 models
could carry equal weight to the more numerous 2024 models. Overall, we consistently achieved top-tier
performance, with only Batch 2 not securing a 1st place preliminary rank.</p>
        <p>Across all batches, our systems consistently outperformed the median and often rivaled or exceeded
the best competitors. In Batch 1, we obtained the best MAP and precision. In Batches 3 and 4, our
systems ranked 1st overall. These results validate the evolution of our system design—starting from small
ensembles in early batches to robust DPRF-integrated systems in later ones. Despite some inconsistencies
between validation and submission results, likely due to training configuration discrepancies, our
systems demonstrated strong performance across diverse configurations and evolving evaluation
standards.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Oficial Results - Phase A plus</title>
        <p>A summary of the systems submitted across each Phase A plus batch is presented in Table 6, with the
performance results shown in Table 7. Each system varies in terms of prompting strategies and models.
Some techniques were developed between batches, which explains the evolving configurations across
submissions, however we tried to keep most things stable between batches, especially since we did not
get any intermediate results between batches.</p>
        <p>Batch 2</p>
        <p>Batch 3
O_10_4 N_10_5
Summ. N P1 (3) O_10_5
Summ. N P2 (3) Summ. N P5 (3)
Summ. N P4 (3) Summ. (diferent order) (3)
Summ. of 1,2,3 (3) Summ. All N (9)</p>
        <p>In Batch 1, we began by establishing baseline systems. Systems 0 and 1 represented our strongest
individual model/prompt configurations based on manual inspection: specifically, Nemotron and
OpenBioLLM using Prompt 3 and 5 abstracts, respectively. We then tested our summarization strategy—believed
to outperform individual runs based on qualitative assessment—on both configurations with 3 abstracts
(System-3) and 5 abstracts (System-4). According to automatic evaluation metrics, System-0 achieved
the highest F1 score (12.98), with System-1 slightly behind (12.43). However, it is worth noting that
System-1 obtained nearly double the recall of System-0. In contrast, our summarization-based systems
performed worse than the individual runs, which was surprising given our initial expectations.</p>
        <p>In Batch 2, we expanded our experiments with diferent prompt configurations. System-0 again
consisted of a single model (OpenBioLLM), this time with 10 abstracts and using Prompt 4. This
configuration performed reasonably well, achieving a rank of 8. Systems 1 through 3 explored various
prompts using Nemotron, each paired with summarization. System-4 was an ensemble summary
combining the outputs from the previous three. Despite our belief in the quality of these summaries,
their performance did not surpass that of the single-model system.</p>
        <p>Batch 3 introduced yet another prompt variation, and we reintroduced Nemotron for direct
comparison. System-0 featured the Nemotron run, while System-1 included the OpenBioLLM configuration,
which again achieved the best score among our submissions in this batch—though with a lower overall
rank than in earlier batches. An additional experiment in this batch assessed the efect of answer
ordering within summaries. F1 scores showed only marginal diferences between diferent orderings,
suggesting the models were relatively robust to such variations.</p>
        <p>Batch 4 marked a major shift with the introduction of our fine-tuned Gemma 3 model. We submitted
two systems using this model: System-1 used the checkpoint after the first epoch, while System-3
used the checkpoint after the third. System-1 achieved the highest F1 score in the entire competition,
while System-3 placed tenth. Based on these automatic metrics, it appears that fine-tuned models
represent a significant advancement. However, we caution against overinterpreting these results without
human evaluation. Last year, similar trends in automatic scores were ultimately contradicted by human
assessment, which favored an LLM-generated response over our best-performing fine-tuned model.</p>
        <p>Overall, this year’s automatic results often contradicted our qualitative evaluations and intuitions.
All submitted runs were qualitatively selected for answer quality and alignment with BioASQ standards.
From this review, Nemotron appeared to produce significantly better responses than OpenBioLLM—yet
this was not reflected in the automatic metrics. Similarly, our summarization-based systems were
believed to be more aligned with BioASQ’s ideal answers, but they consistently underperformed in F1
evaluations, even in ensemble settings. These findings highlight the limitations of automatic metrics
in capturing the nuanced quality of biomedical question answering and reinforce the importance of
incorporating human judgments into final system evaluations.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Oficial Results - Phase B</title>
        <p>A summary of the systems submitted across each Phase B batch is presented in Table 8, with the
performance results shown in Table 9. Each system varies in terms of prompting strategies and models.
Some techniques were developed between batches, which explains the evolving configurations across
submissions, however we tried to keep most things stable between batches, especially since we did not
get any intermediate results between batches.</p>
        <p>In Batch 1, we began by establishing baseline systems. System-0 represented our strongest individual
model/prompt configuration based on manual inspection: Nemotron using Prompt 3 and snippets.
We then evaluated our summarization strategy across various settings: all Nemotron models using
abstracts (9 answers, System-1), all outputs from Nemotron (12 answers, System-2), all outputs from
OpenBioLLM (12 answers, System-3), and a final summary combining Systems 2 and 3. According to
automatic evaluation metrics, System-0 achieved the highest F1 score (18.64), with all other systems
performing worse.</p>
        <p>In Batches 2 and 3, we kept the submissions largely consistent. System-0 again featured Nemotron
with snippets, now using an updated prompt (Prompt 5). The remaining systems included summaries of
all snippet outputs (System-1 and System-6), Nemotron with Prompt 4 (8 answers, System-2), Nemotron
with Prompt 5 (8 answers, System-3), and a summary combining Systems 2 and 3. Once again, System-0
outperformed the others, although the performance gap narrowed. It remains unclear whether the
updated prompt reduced System-0’s relative efectiveness or improved the summaries’ quality. In Batch
4, we introduced our custom fine-tuned model (Systems 2 and 3), which performed relatively well
according to automatic metrics, achieving 6th and 7th place overall.</p>
        <p>Overall, we opt not to provide a detailed analysis of individual systems, as we expect many
conclusions to shift once human evaluation results are available. Additionally, the rank diferences among
our summarization systems are relatively small, suggesting limited significance in their comparative
performance at this stage.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Future Work</title>
      <p>
        This year, we would like to reflect more broadly on the competition itself, its evaluation process, and
critically assess our own submissions. One key issue we observed—also highlighted by our experience
in past years—is the importance of having access to the correct, finalized evaluation results. The
discrepancy between intermediate results and the true gold standard has, in the past, led us to favor
systems that later proved suboptimal. While we do not intend to critique the competition’s structure,
we do not draw any strong conclusions since the final gold standard will afect the results. This
issue regarding misalignment is a known issue in existing literature where various authors discuss
misalignment between automatic metrics and human annotation [
        <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17</xref>
        ].
      </p>
      <p>This year, the lack of intermediate results between batches inadvertently had a positive efect: it
encouraged us to focus more on consistency and robustness across batches rather than optimizing
for short-term performance. Additionally, we would like to emphasize the importance of evaluating
cross-batch consistency. Some systems perform well in specific batches but not in others, indicating that
the data distribution can shift between batches. This variability underlines the importance of building
systems that generalize well and highlights a valuable aspect of the BioASQ challenge.</p>
      <p>In Phase A, we achieved strong and consistent performance, comparable to last year, which we find
encouraging. Notably, our DPRF-based submissions achieved top results in all three batches where they
were used, outperforming larger ensemble systems that we initially assumed would perform better. This
is a promising direction for future work—pending confirmation from human evaluation. We also plan to
further explore optimal ensemble strategies, as well as alternative embedding methods beyond BGE-M3
for DPRF. We would also like to further investigate diferent types of ensembling beyond RRF including
reward modeling and performance aware selection. Additionally, we aim to experiment with prompt
adaptation based on question similarity, potentially yielding more tailored and relevant answers.</p>
      <p>Regarding Phase B (generation), we refrain from in-depth analysis at this stage due to the absence
of human evaluation, which plays a crucial role in interpreting results. We prefer to avoid drawing
premature conclusions that may not align with qualitative assessments once they become available.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In Phase A, our DPRF-based systems consistently performed better than our larger ensembles, indicating
that carefully designed retrieval and reranking strategies can rival, and even exceed, the performance
of larger ensembles, providing a good direction for future work. In Phase B, while our LLM Prompts
showed promising output quality, automatic evaluation results did not always align with our qualitative
assessments—highlighting the ongoing challenge of evaluation in generative tasks. On the other
hand our fine-tuned model obtained very good automatic metrics. However, we cannot conclude
whether this is due to its alignment with the style of BioASQ or because it genuinely outperforms
the other submissions. We remain cautious in interpreting the results before final gold standards and
human evaluations are available. Ultimately, BioASQ continues to be a valuable platform for rigorous
benchmarking and reflective system development, and we look forward to further contributing in future
editions.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was partially supported by the Fundação para a Ciência e a Tecnologia (FCT) under research
unit UIDB/00127 – IEETA. Richard A. A. Jonker is funded by the FCT doctoral grant PRT/BD/154792/2023,
with DOI identifier https://doi.org/10.54499/PRT/BD/154792/2023.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT as a writing assistant.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Prompts</title>
      <sec id="sec-9-1">
        <title>Prompt 1 Context: {context} Question: {question}</title>
      </sec>
      <sec id="sec-9-2">
        <title>Prompt 2</title>
      </sec>
      <sec id="sec-9-3">
        <title>Question: {question} {context} Prompt 3 Answer in less than 150 words, present a final json containing your answer {{"answer": answer}}</title>
        <p>Act as a biomedical expert. You will receive several {d_type} summarizing research findings and
methodologies. Along with this, a question will be provided (’[question]’).</p>
        <p>Your role is to analyze the {d_type} and provide a scientifically accurate, concise answer to the
question, leveraging the information from the {d_type}.</p>
        <p>Answer in less than 150 words, present a final json containing your answer {{"answer": answer}}.
Act as a biomedical expert. You will receive several {d_type} summarizing research findings and
methodologies. Along with this, a question will be provided. Your role is to analyze the {d_type}
and provide a scientifically accurate, concise answer to the question, leveraging the information
from the {d_type}.</p>
        <p>First read and understand the relevant information present in the several {d_type}, extracting all
relevant facts.</p>
        <p>After thinking about the information presented, present a final json containing your answer
{"answer": answer}. Please show all your reasoning first.</p>
      </sec>
      <sec id="sec-9-4">
        <title>Answer in around 50–150 words, without using any markdown.</title>
      </sec>
      <sec id="sec-9-5">
        <title>Question: {question}</title>
        <p>Act as a biomedical expert. You will receive several {d_type} summarizing research findings and
methodologies. Along with this, a question will be provided. Your role is to analyze the {d_type}
and provide a scientifically accurate, concise answer to the question, leveraging the information
from the {d_type}.</p>
        <p>First read and understand the relevant information present in the several {d_type}, extracting all
relevant facts, explaining all your reasoning.</p>
        <p>After thinking about the information presented, present a final json containing your answer
{"answer": answer}.</p>
        <p>Answer in around 50–150 words, use a concise format only with plain text (no lists or markdown).</p>
      </sec>
      <sec id="sec-9-6">
        <title>Question: {question} {context}</title>
      </sec>
      <sec id="sec-9-7">
        <title>Prompt 5</title>
        <p>Act as a biomedical expert. You will receive several {d_type} summarizing research findings and
methodologies. Along with this, a question will be provided. Your role is to analyze the {d_type}
and provide a scientifically accurate, concise answer to the question, leveraging the information
from the {d_type}.</p>
        <p>First read and understand the relevant information present in the several {d_type}, extracting all
relevant facts, explaining all your reasoning.</p>
        <p>After thinking about the information presented, present a final json containing your answer
{"answer": answer}.</p>
        <p>Answer in around 50–150 words, use a concise format only with plain text (no lists or markdown).</p>
      </sec>
      <sec id="sec-9-8">
        <title>For example:</title>
      </sec>
      <sec id="sec-9-9">
        <title>Question: "What is the use of P85-Ab?"</title>
      </sec>
      <sec id="sec-9-10">
        <title>Insert your thinking here.</title>
        <p>{"answer": "P85-Ab is a promising novel biomarker for nasopharyngeal carcinoma screening."}</p>
      </sec>
      <sec id="sec-9-11">
        <title>Question: {question}</title>
        <p>{context}
Summary Prompt 1
Act as a biomedical expert. You will receive multiple answers to a given question. Your task
is to analyze these responses, extract all relevant information, and synthesize a concise yet
comprehensive final answer (50–150 words, at least one complete sentence).
1. Carefully read and understand the key facts and insights from the provided answers.
2. Thoughtfully evaluate the information to form a well-reasoned conclusion.
3. Present your reasoning step-by-step before delivering the final response.</p>
      </sec>
      <sec id="sec-9-12">
        <title>Finally, output a JSON object in the following format: {"answer": "&lt;your concise answer&gt;"}</title>
      </sec>
      <sec id="sec-9-13">
        <title>Guidelines: - Ensure the answer is informative, clear, and medically accurate. - Do not use Markdown formatting. - Keep the response within the word limit.</title>
      </sec>
      <sec id="sec-9-14">
        <title>Question: {question}</title>
      </sec>
      <sec id="sec-9-15">
        <title>Answers: {answers}</title>
      </sec>
      <sec id="sec-9-16">
        <title>Summary Prompt 2</title>
        <p>Act as a biomedical expert. You will receive multiple answers to a given question. Your task
is to analyze these responses, extract all relevant information, and synthesize a concise yet
comprehensive final answer (50–150 words, at least one complete sentence).
1. Carefully read and understand the key facts and insights from the provided answers.
2. Thoughtfully evaluate the information to form a well-reasoned conclusion.
3. Present your reasoning step-by-step before delivering the final response.</p>
      </sec>
      <sec id="sec-9-17">
        <title>Finally, output a JSON object in the following format: {{"answer": "&lt;your concise answer&gt;"}}</title>
        <p>Guidelines:
- Ensure the answer is informative, clear, and medically accurate.
- Do not use Markdown formatting.
- Keep the response within the word limit.</p>
        <p>For example:</p>
      </sec>
      <sec id="sec-9-18">
        <title>Question: "What is the use of P85-Ab?"</title>
        <p>Insert your thinking here.
{{"answer": "P85-Ab is a promising novel biomarker for nasopharyngeal carcinoma screening."}}</p>
      </sec>
      <sec id="sec-9-19">
        <title>Question: {question}</title>
      </sec>
      <sec id="sec-9-20">
        <title>Answers: {answers}</title>
        <p>42
100
100
100
False
True
True
True
True
True
True
True
False
True
True
True
True
True
True
True
True</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. N. Maria</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Nunzio</surname>
          </string-name>
          , Giorgio,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A. A.</given-names>
            <surname>Jonker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Reis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Almeida</surname>
          </string-name>
          , S. Matos, BIT.UA at BioASQ 12:
          <article-title>From Retrieval to Answer Generation</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . García Seco de Herrera (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings</source>
          , Grenoble, France,
          <year>2024</year>
          , pp.
          <fpage>47</fpage>
          -
          <lpage>67</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          / paper-05.pdf,
          <source>notebook for the BioASQ Lab at CLEF</source>
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mallia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Siedlaczek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mackenzie</surname>
          </string-name>
          , T. Suel,
          <article-title>PISA: performant indexes and search for academia, in: Proceedings of the Open-Source IR Replicability Challenge co-located with 42nd</article-title>
          <source>International ACM SIGIR Conference on Research and Development in Information Retrieval, OSIRRC@SIGIR</source>
          <year>2019</year>
          , Paris, France, July
          <volume>25</volume>
          ,
          <year>2019</year>
          .,
          <year>2019</year>
          , pp.
          <fpage>50</fpage>
          -
          <lpage>56</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2409</volume>
          /docker08.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>MacAvaney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <article-title>A python interface to PISA!</article-title>
          , SIGIR '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>3339</fpage>
          -
          <lpage>3344</lpage>
          . URL: https://doi.org/10.1145/3477495.3531656. doi:
          <volume>10</volume>
          .1145/3477495.3531656.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          , H. Cheng, M. Lucas,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <article-title>Domainspecific language model pretraining for biomedical natural language processing</article-title>
          ,
          <source>ACM Trans. Comput. Healthcare</source>
          <volume>3</volume>
          (
          <year>2021</year>
          ). URL: https://doi.org/10.1145/3458754. doi:
          <volume>10</volume>
          .1145/3458754.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yasunaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          , P. Liang,
          <article-title>LinkBERT: Pretraining language models with document links</article-title>
          , in: S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.),
          <source>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>8003</fpage>
          -
          <lpage>8016</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>551</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>551</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Bge m3-embedding: Multi-lingual, multifunctionality, multi-granularity text embeddings through self-knowledge distillation</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>03216</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Team</surname>
          </string-name>
          , Ollama:
          <article-title>Run large language models locally</article-title>
          , https://ollama.com,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>LMDeploy</given-names>
            <surname>Contributors</surname>
          </string-name>
          ,
          <article-title>Lmdeploy: A toolkit for compressing, deploying, and serving llm</article-title>
          , https: //github.com/InternLM/lmdeploy,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aithal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Anh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Brundyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Casper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Catanzaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cohen</surname>
          </string-name>
          , et al.,
          <source>Nemotron-4 340b technical report, arXiv preprint arXiv:2406.11704</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bukharin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Delalleau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Egert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kuchaiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          , Helpsteer2
          <article-title>- preference: Complementing ratings with preferences</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2410.01257. arXiv:
          <volume>2410</volume>
          .
          <fpage>01257</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Ankit</surname>
          </string-name>
          <string-name>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Openbiollms: Advancing open-source large language models for healthcare and life sciences</article-title>
          , https://huggingface.co/aaditya/OpenBioLLM-Llama3-70B,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          , Gemma
          <volume>3</volume>
          (
          <year>2025</year>
          ). URL: https://goo.gle/Gemma3Report.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Lora: Low-rank adaptation of large language models</article-title>
          .
          <source>, ICLR</source>
          <volume>1</volume>
          (
          <year>2022</year>
          )
          <article-title>3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V.</given-names>
            <surname>Adlakha</surname>
          </string-name>
          , P. BehnamGhader,
          <string-name>
            <given-names>X. H.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Meade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <article-title>Evaluating correctness and faithfulness of instruction-following models for question answering</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>12</volume>
          (
          <year>2024</year>
          )
          <fpage>681</fpage>
          -
          <lpage>699</lpage>
          . URL: https://doi.org/10.1162/tacl_a_00667. doi:
          <volume>10</volume>
          . 1162/tacl_a_
          <fpage>00667</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Stanovsky,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <article-title>Evaluating question answering evaluation</article-title>
          ,
          <source>in: Proceedings of the 2nd workshop on machine reading for question answering</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>119</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rahnamoun</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Shamsfard, Multi-layered evaluation using a fusion of metrics and llms as judges in open-domain question answering</article-title>
          ,
          <source>in: Proceedings of the 31st International Conference on Computational Linguistics</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>6088</fpage>
          -
          <lpage>6104</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>