<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Prompting Matters: Snippet-Aware Strategies for Biomedical QA with LLMs in BioASQ 13b</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hajung Kim</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hoonick Lee</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yewon Cho</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jungwoo Park</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jueon Park</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soyon Park</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yan Ting Chok</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seungheun Baek</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Donghyeon Lee</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaewoo Kang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AIGEN Sciences</institution>
          ,
          <addr-line>Seoul, 04778</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science and Engineering, Korea University</institution>
          ,
          <addr-line>Seoul, 02841</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Biomedical question answering (QA) plays a critical role in enabling eficient access to clinical and scientific knowledge, yet remains a challenging task due to domain complexity, terminology ambiguity, and evidence integration from multiple sources. In this study, we address these challenges in the context of the BioASQ 13b Task B challenge, which evaluates systems across yes/no, factoid, and list question types. In this work, we investigate the impact of prompt design on biomedical question answering in the BioASQ 13b Task B challenge. Specifically, we evaluate: (1) a standard format combining questions with all retrieved snippets, (2) randomized prompting with shufled snippet orders and repeated trials, (3) a one-by-one snippet querying strategy with output aggregation, and (4) a no-snippet condition relying solely on the model's parametric knowledge. Final predictions are selected via majority voting or log-probability-based ranking, depending on the task type. In the ifnal evaluation, team rankings were determined by averaging the best ranks achieved across sub-tasks (yes/no, factoid, and list), regardless of whether the top- performing results came from the same system. Based on this ranking scheme, our team achieved the highest average rank in Batches 1 and 4, and the second-highest in Batches 2 and 3, demonstrating the robustness and efectiveness of our prompt design.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;BioASQ 13b</kwd>
        <kwd>LLM</kwd>
        <kwd>question-answering</kwd>
        <kwd>prompt engineering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Accessing high-quality biomedical information remains a significant challenge for clinicians and
researchers who must navigate heterogeneous databases and fragmented knowledge sources to find
precise, evidence-based answers. Traditional keyword-based search engines are often inadequate for
handling complex expert queries, especially when answers are dispersed across multiple scientific
articles. To address these limitations, the BioASQ challenge [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] has been a cornerstone benchmark for
biomedical question answering (QA) systems. The challenge, now in its thirteenth edition (BioASQ 13b,
2025 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), is designed to promote the development of AI systems that can semantically index biomedical
literature and generate accurate, context-aware responses to expert-written questions. Among the
various tasks, BioASQ Task B—Biomedical Semantic QA—requires systems to answer questions using
evidence snippets retrieved from PubMed [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], while returning both exact answers (such as yes/no,
entities, or lists) and ideal summaries. Task B encompasses four question types: yes/no, factoid, list,
and ideal answers. In this work, we focus on the first three types, which require precise and structured
responses grounded in evidence. BioASQ 13b provides the largest training corpus to date, comprising
5,389 expert-authored questions with carefully curated snippet evidence, ofering a unique testbed for
evaluating modern LLM-based QA systems.
      </p>
      <p>
        While recent advances in large language models (LLMs) such as GPT-4 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Claude [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have shown
impressive general-domain QA capabilities, their performance in biomedical contexts is highly sensitive
to how information is presented in the prompt. In particular, biomedical QA demands integrating
multiple retrieved snippets, each of which may contain partial or conflicting evidence. Prompt design
thus plays a pivotal role in determining whether the model can reason over the provided context
efectively.
      </p>
      <p>In this study, we systematically explore several snippet-aware prompting strategies to better
understand their impact on biomedical QA performance. We evaluate: (1) a default format that combines
task instruction, question, and the complete list of snippets in a single prompt; (2) a randomized
snippet order setting that introduces permutation over snippet order across repeated trials to reduce
positional bias; (3) a one-by-one approach that queries the model with each snippet separately and
aggregates the answers; and (4) a prior-knowledge-only condition in which the model generates an
answer without any evidence snippets, relying solely on parametric knowledge.</p>
      <p>We apply these prompting strategies to multiple LLMs, including GPT-4o-mini, GPT-4, and Claude, and
consolidate the outputs using log-probability-based ranking and majority voting to improve reliability.
This unified evaluation reveals the strengths and limitations of each prompting approach and model,
ofering practical insights for designing robust biomedical QA systems. Our system achieved top-tier
performance in the BioASQ 13b Task B challenge, ranking first in batches 1 and 4 and second in
batches 2 and 3. These results highlight the critical importance of prompt structure, especially in
evidence-sensitive tasks such as factoid and list-type QA.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description</title>
      <p>
        The BioASQ challenge promotes the development of intelligent systems for biomedical question
answering. In this paper, we focus on Phase B of the BioASQ 13b task[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], where systems are required to
generate precise answers to questions using evidence snippets retrieved from PubMed abstracts [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Each question is paired with a set of snippets, and its expected answer format varies depending on the
question type: yes/no, factoid, or list.
      </p>
      <p>Yes/No. For this question type, the model is expected to answer "yes" or "no" based on the given
snippets. An example is: “Is age an underlying factor in the eye disease AMD?” Answering yes/no
questions often requires considering multiple snippets rather than relying on a single snippet alone.
The evaluation is based on accuracy or macro-averaged F1-score, where each label ("yes" or "no") is
treated as a separate class. This approach ensures that the model is not biased toward a majority class,
and it is fairly evaluated on positive and negative responses. The macro-averaged F1-score is computed
as:</p>
      <p>1
Macro-F1 = ︀( F1yes + F1no)︀</p>
      <p>2
where for each class  ∈ {yes, no}, the precision, recall, and F1 score are given by:
Precision =</p>
      <p>+  
,</p>
      <p>Recall =</p>
      <p>+  
Here,  ,  , and   denote the number of true positives, false positives, and false negatives for
class , respectively.</p>
      <p>
        Factoid. Factoid questions expect a single answer that best represents the factual information
the requested. An example is: “What is the target of fezolinetant?” The ground-truth answers are
usually, though not always, contained within the snippets. This type of question difers from other QA
tasks, such as SQuAD [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], where answers can always be extracted from the given context. Answers to
factoid-type questions are represented as a double list of possible answers, where each inner list contains
acceptable variants of the same correct answer. In the evaluation, higher scores are assigned when the
preferred correct answer appears earlier in the system’s ranked list of predictions. Factoid questions are
evaluated using the Mean Reciprocal Rank (MRR), which accounts for both the correctness and the
ranking position of the answers. Specifically, the reciprocal of the rank at which the first correct answer
yesno
factoid
list
# of questions avg.# of snippets # of questions avg # of snippets # of questions avg.# of snippets
11b test
12b test
appears in the system’s prediction list is computed for each question. The MRR is then calculated as the
average of these reciprocal ranks over all factoid questions:
      </p>
      <p>MRR =
1 ∑|︁|</p>
      <p>1
|| =1 rank
where  is the set of factoid questions, and rank denotes the position of the first correct answer for
the -th question. If no correct answer is found in the predicted list, the reciprocal rank for that instance
is defined as zero.</p>
      <p>List. List-type questions require the system to return a list of correct answers, i.e., more than
one acceptable answer. An example is: “Please list the symptoms of Chikungunya virus infection”.
Although list-type questions may appear similar to factoid-type questions, they difer in that they
expect multiple distinct answers rather than a single best answer or alias group. Each item in the gold
standard represents a unique element, and the system is evaluated based on how many of these items
are correctly retrieved. Evaluation is conducted using standard classification metrics: precision, recall,
and F1-score, calculated based on the overlap between the predicted and gold-standard answer sets.
Unlike factoid questions, the order of the answers does not afect the evaluation score; only the set
overlap is considered.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>
          We utilized the oficial BioASQ datasets [
          <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
          ] released as part of the BioASQ Challenge. In total, 5,389
biomedical questions were provided by the challenge organizers, categorized into three types: yes/no,
factoid, and list. Each question is accompanied by supporting snippets from PubMed abstracts.
        </p>
        <p>
          For evaluation, we constructed two internal test subsets based on the oficial BioASQ 11b and 12b test
sets. The BioASQ 11b test data contains 86 yes/no, 98 factoid, and 66 list questions, with an average of
9.45, 7.39, and 15.15 snippets per question, respectively. The second test set is derived from the BioASQ
12b test set and follows the same annotation format. See Table 1 for detailed statistics.
3.2. Model
To address BioASQ Task B, which focuses on biomedical question answering across multiple
formatsincluding yes/no, factoid, and list-type questions-we employed a suite of state-of-the-art large language
models (LLMs): GPT-4-0125-preview, GPT-4o-mini, GPT-4o (developed by OpenAI), and Claude
(developed by Anthropic). These models were chosen for their strong reasoning capabilities, broad biomedical
knowledge, and robust performance in open-domain QA tasks. Our model selection was motivated in
part by the success of previous BioASQ challenge participants [
          <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
          ], many of whom integrated
GPT-based models—such as GPT-3.5 and early GPT-4 variants—into their systems and achieved
competitive results. These precedents demonstrated the efectiveness of large-scale LLMs in biomedical
Prompt for Yes/No
INSTRUCTIONS:
1. Base your answer ONLY
on the information provided
in the [Snippets] section.
2. If there is any conflict
between the snippet
information and your internal
knowledge, ALWAYS
prioritize the snippet.
3. Provide your final answer
solely based on the snippet
content.
4. You must answer only with
lowercase "yes" or "no".
        </p>
        <p>If you are not sure, answer
"none".
[Snippets]:
&lt;snippets&gt;
[Question]:
&lt;question&gt;</p>
        <p>Prompt for Factoid
INSTRUCTIONS:
1. Base your answer ONLY
on the information provided
in the [Snippets] section.
2. If there is any conflict
between the snippet
information and your internal
knowledge, ALWAYS
prioritize the snippet.
3. Provide your final answer
solely based on the snippet
content.
4. Provide 1–2 items only.</p>
        <p>Use a single term if
synonyms exist.</p>
        <p>Return a JSON string array
of concise entity names or
numbers.</p>
        <p>Return an empty list if
unknown.
[Snippets]:
&lt;snippets&gt;
[Question]:
&lt;question&gt;</p>
        <p>Prompt for List
INSTRUCTIONS:
1. Base your answer ONLY
on the information provided
in the [Snippets] section.
2. If there is any conflict
between the snippet
information and your internal
knowledge, ALWAYS
prioritize the snippet.
3. Provide your final answer
solely based on the snippet
content.
4. Provide up to 7–8 items
only. Use a single term if
synonyms exist.</p>
        <p>Return a JSON string array
of concise entity names or
numbers.</p>
        <p>Return an empty list if
unknown.
[Snippets]:
&lt;snippets&gt;
[Question]:
&lt;question&gt;</p>
        <p>QA settings and provided a practical foundation for adopting more advanced models in our pipeline.
Additionally, we prioritized the GPT-4 series over GPT-3.5 due to their support for structured outputs,
such as generating answers in JSON format. This feature is especially valuable in BioASQ, where
answers must adhere to specific schemas depending on question type (e.g., boolean for yes/no, single
entities for factoid, or lists for list questions). Relying on models that directly produce structured
responses allowed us to reduce post-processing overhead and minimize potential formatting errors,
thereby improving overall pipeline robustness and evaluation compatibility.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Prompt Design</title>
        <p>To accommodate the distinct requirements of the three question types in BioASQ Task B-yes/no, factoid,
and list-we designed custom prompts tailored to each type. These prompts enforced reliance on provided
snippets and required structured outputs where applicable. To ensure that the model reliably followed
these constraints, we carefully phrased and, in some cases, repeated key instructions within the prompts
to reinforce snippet-based answering behavior. The detailed formats for each prompt are provided in
Table 2.</p>
        <p>To improve the model’s ability to capture relevant entities, particularly when snippets are long or
contain dispersed information, we experimented with multiple snippet input configurations. We initially
designed multiple strategies not only to ensure coverage, but also to assess their relative efectiveness.
Our goal was to reduce the chance of missing key evidence by exploring diferent ways of presenting
the input. The following four strategies were implemented:
• Full Snippet (Default Order): All available snippets associated with a given question were
concatenated and provided to the model in the original order as retrieved from the dataset. This
setting reflects a realistic scenario where the model receives the complete context at once and is
expected to synthesize information across multiple evidence sentences.
• Random Ordered Full Snippet: All snippets were concatenated into a single input, but the
order of the snippets was randomly shufled. This configuration was used to examine the model’s
robustness to non-sequential evidence and to assess whether reasoning performance depends on
snippet ordering.
• Single Snippet (One-by-One): Each snippet was presented to the model individually, paired
with the same question. The model generated an answer for each snippet independently, and
the final answer was determined by aggregating outputs (e.g., via majority vote for yes/no, or
top-confidence entity selection for factoid/list). This approach aimed to isolate the model’s ability
to extract relevant information from minimal context without being influenced by noisy or
conflicting snippets.
• No Snippet: The question was presented without any snippets, forcing the model to answer
based solely on internal knowledge.</p>
        <p>By applying all of these strategies across question types, we aimed to enhance answer coverage and
mitigate the risk of evidence omission.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Ensemble Strategy</title>
        <p>We performed ensemble across the outputs of multiple models, using task-specific strategies described
below:
• Yes/No Questions: For each yes/no question, we collected all responses generated by the models
and applied a majority voting scheme. Only responses labeled as "yes" or "no" were considered
valid votes; responses labeled as "none" were treated as abstentions and excluded from the
voting pool. The label with the highest number of votes among valid responses was selected as
the final answer. This approach ensured that the decision reflected the consensus among models
while discarding uncertain outputs.
• Factoid Questions: In the case of factoid questions, each model produced one or more entity
candidates along with associated log probabilities. We aggregated all unique entity candidates
across models and summed their log probabilities when the same entity appeared multiple
times. The final answer was determined by selecting the top-ranked entities according to these
cumulative scores. This ranking-based strategy allowed us to combine probabilistic confidence
across models and prioritize entities with consistent high-confidence support.
• List Questions: For list-type questions, we applied a frequency-based ensemble method. Each
model generated a list of entities as potential answers. We counted how many times each entity
appeared across all model outputs, regardless of position. Entities that exceeded a predefined
occurrence threshold were included in the final list. This threshold was chosen to balance precision
and recall, ensuring that only consistently predicted entities were retained while filtering out
spurious or low-confidence entries.</p>
        <p>These ensemble strategies were applied uniformly across all model outputs to improve robustness
and answer reliability. By leveraging complementary predictions from multiple models and aggregating
them in a task-aware manner, we aimed to reduce variance, correct isolated errors, and enhance overall
consistency. This approach allowed us to take advantage of both model diversity and redundancy,
ensuring that the final predictions reflected a more stable consensus across diferent models and decoding
conditions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Result</title>
      <sec id="sec-4-1">
        <title>4.1. Oficial Evaluation on BioASQ-13b</title>
        <p>Tables 3, 4, 5 show that our best-performing models achieved top scores in most batches across all
question types. In these tables, the "Dense Rank" column represents the system’s rank on the leaderboard
without skipping positions for ties (i.e., ranks increase sequentially even when scores are tied). The
"Description" column follows the format "Ensemble (k), Prompt Type", where k denotes the number of
model outputs combined, and Prompt Type indicates the prompting strategy used (e.g., One-by-One). In
particular, in the yes/no type, all our systems consistently demonstrated stable and strong performance
across all batches. Most systems were designed around the one-by-one strategy, in which answers
are predicted separately for each snippet. We also integrated variations such as generating responses
without snippets and randomly selected snippets to diversify perspectives. To better handle uncertainty,
we applied rule-based fallback strategies triggered by indicators like the ratio of "none" responses or
imbalanced yes/no predictions. Based on these signals, answers were either substituted from other
systems or selected based on log-probability confidence. While many systems remained small-scale
ensembles of two models, some incorporated up to 12 diferent outputs. The best-performing models
efectively combined these techniques, reaching a macro F1 score of 1.0000 in both Batches 1 and 2.</p>
        <p>All systems except one were based on single-model configurations for the factoid task. We tested
diferent answer selection methods, such as log-probability scoring with rule-based filters, and a
Reciprocal Rank Fusion (RRF) strategy in one case. The systems also varied in how they processed input:
some relied on all available snippets, while others processed each one independently. No ensemble
was used, except in a single submission for Batch 4, which combined three diferent outputs using RRF
and yielded competitive performance. Despite focusing primarily on individual systems, we achieved
strong MRR scores, consistently ranking among the top in several batches, including two second-place
ifnishes in Batch 1 and Batch 4.</p>
        <p>For the list-type task, we generated 16 candidate outputs by applying four distinct prompting
strategies to four diferent LLMs. Most of our systems were ensemble models that combined a subset
of these outputs. In Batches 1 and 3, we used fixed combinations of prompting strategies such as
full snippet, random ordered full snippet, and one-by-one prompting. Batch 2 employed a weighted
ensemble over combinations, where the models and their weights were determined based on F1 scores
from internal validation across all possible 4-model combinations among the 16 candidates. In Batch 4,
we selected four models from the full candidate pool and combined them using equal weights. Overall,
performance improvements across batches were primarily driven by efective model selection and the
diversity of prompting strategies, rather than the size of the ensemble.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluation on 11b and 12b</title>
        <p>We conducted an ablation study using the GPT-4o-mini model to evaluate how diferent prompt
design strategies afect biomedical question answering performance. As summarized in Table 6, we
compared the following four configurations:
• Full Snippet (Default Order): The full set of retrieved evidence snippets is concatenated in
their original order and provided in a single prompt.
• Random Ordered Full Snippet: Snippets are randomly reordered and queried multiple times
to mitigate positional bias and enhance evidence diversity.
• Single Snippet (One-by-One): Each snippet is paired individually with the question. The model
produces separate outputs, which are aggregated to form the final answer. In some cases, GPT-4
was used as the decoder in this configuration.
• No Snippet: The model answers the question without access to any supporting snippets, relying
solely on its parametric knowledge.
Yes/No. The random ordered full snippet setting yielded the highest macro-F1 score (0.9795),
demonstrating that randomized snippet ordering can improve robustness. The full snippet setting achieved a
macro-F1 of 0.8844, suggesting that a holistic view of the full snippet set provides consistent performance.
Factoid. The one-by-one setting consistently outperformed all others, achieving the best strict
accuracy, lenient accuracy, and MRR (0.7623, 0.8398, and 0.7941, respectively). These results indicate that
isolating snippets can help the model better extract concise factual entities, especially when they are
sparsely distributed across evidence.</p>
        <p>List. Performance on list-type QA revealed a trade-of between coverage and precision. The full
snippet format achieved the highest F1 score (0.6675) in the high-resource setup, likely due to its ability
to synthesize entity mentions across snippets. In contrast, the one-by-one strategy showed reduced
recall, possibly due to fragmented evidence limiting answer aggregation.</p>
        <p>No Snippet Setting. The no snippet baseline consistently underperformed across all question types.
This confirms the importance of retrieved evidence, particularly for factoid and list questions where
external grounding is essential.</p>
        <p>These findings highlight the crucial role of prompt design in biomedical QA. While one-by-one
prompting is particularly efective for entity-centric questions like factoids, the original format ofers
more balanced performance for yes/no and list questions—especially under constrained conditions.
Randomized snippet ordering also improves model robustness, underscoring the benefit of prompt-level
perturbations in multi-evidence settings.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this study, we developed a series of systems for the BioASQ-13b Challenge Phase B, targeting yes/no,
factoid, and list question types. Our analysis revealed that the efectiveness of prompting strategies
varied across tasks. Yes/no questions, framed as binary classification problems, benefited from
fullcontext prompts that enabled holistic reasoning over all provided snippets. In contrast, factoid questions,
which require precise entity-level retrieval, performed better with one-by-one prompting, where each
snippet is processed independently to ensure comprehensive coverage of fine-grained information. This
approach was particularly useful when handling long or noisy snippet sets, where key evidence might
otherwise be overlooked.</p>
      <p>Guided by these insights, we applied task-specific ensemble strategies. For the factoid task, we
used a log-probability-based ensemble: each model produced a ranked list of entity candidates with
associated log-probabilities. We aggregated all unique entity candidates across models and summed
their log-probabilities when the same entity appeared multiple times. The final answer was selected by
ranking entities based on these cumulative scores, efectively capturing probabilistic consensus across
models. For list-type questions, we employed a frequency-based ensemble: each model generated a
list of candidate entities, and we counted the number of times each entity appeared across outputs,
regardless of position. Entities that exceeded a predefined occurrence threshold were retained in the
ifnal answer. This method helped balance precision and recall, promoting entities with consistent
support while filtering out low-confidence or spurious predictions.</p>
      <p>Overall, our results demonstrate that task-aware prompting combined with lightweight ensemble
techniques can efectively enhance system performance. This strategy ofers a practical and interpretable
framework for biomedical question answering with large language models.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Funding</title>
      <p>This work was supported by the National Research Foundation of Korea (NRF) grant funded by the
Korea government (MSIT) (NRF-2023R1A2C3004176), the Korea Health Industry Development Institute
(KHIDI) grant funded by the Ministry of Health &amp; Welfare, Republic of Korea (HR20C002103), and the
Institute of Information &amp; Communications Technology Planning &amp; Evaluation (IITP) grant funded by
the Korean government (MSIT) (IITP-2025-RS-2020-II201819). This research was also supported by the
RS-2023-00262002 grant.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Declaration on Generative AI</title>
      <p>Generative AI tools were used for writing assistance, including grammar correction and sentence
rephrasing.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tsatsaronis</surname>
          </string-name>
          , G. Balikas,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          , I. Partalas,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zschunke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Alvers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petridis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Polychronopoulos</surname>
          </string-name>
          , et al.,
          <article-title>An overview of the bioasq large-scale biomedical semantic indexing and question answering competition</article-title>
          ,
          <source>BMC bioinformatics 16</source>
          (
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. N. Maria</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Nunzio</surname>
          </string-name>
          , Giorgio,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering</article-title>
          , in: L.
          <string-name>
            <surname>P. A. G. S. d. H. J. M. F. P. P. R. D. S. G. F. N. F. Jorge Carrillo-de Albornoz</surname>
          </string-name>
          , Julio Gonzalo (Ed.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          , E. Tutubalina, G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          , et al.,
          <source>Bioasq</source>
          at clef2025:
          <article-title>The thirteenth edition of the large-scale biomedical semantic indexing and question answering challenge</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2025</year>
          , pp.
          <fpage>407</fpage>
          -
          <lpage>415</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dhingra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. W.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Pubmedqa: A dataset for biomedical research question answering</article-title>
          , arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>06146</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report, arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Anthropic</surname>
          </string-name>
          ,
          <source>The Claude 3 Model Family: Opus</source>
          , Sonnet, Haiku,
          <source>Technical Report, Anthropic</source>
          ,
          <year>2024</year>
          . URL: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_ Card_Claude_3.pdf, model Card.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Paliouras, Overview of BioASQ Tasks 13b and Synergy13 in CLEF2025</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajpurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lopyrev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          , Squad:
          <volume>100</volume>
          ,000+
          <article-title>questions for machine comprehension of text</article-title>
          ,
          <source>arXiv preprint arXiv:1606.05250</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bougiatiotis</surname>
          </string-name>
          , G. Paliouras,
          <string-name>
            <surname>BioASQ-QA</surname>
          </string-name>
          :
          <article-title>A manually curated corpus for Biomedical Question Answering</article-title>
          ,
          <source>Scientific Data</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>170</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Mork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Paliouras,</surname>
          </string-name>
          <article-title>The road from manual to automatic semantic indexing of biomedical literature: a 10 years journey</article-title>
          ,
          <source>Frontiers in Research Metrics and Analytics</source>
          <volume>8</volume>
          (
          <year>2023</year>
          )
          <fpage>1250930</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Merker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bondarenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Viehweger</surname>
          </string-name>
          , Mibi at bioasq 2024:
          <article-title>retrieval-augmented generation for answering biomedical questions</article-title>
          ,
          <source>in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), Grenoble, France, volume
          <volume>3740</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>176</fpage>
          -
          <lpage>187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.-C.</given-names>
            <surname>Chih</surname>
          </string-name>
          , J.-C. Han,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tzong-Han</surname>
          </string-name>
          <string-name>
            <surname>Tsai</surname>
          </string-name>
          ,
          <article-title>Ncu-iisr: enhancing biomedical question answering with gpt-4 and retrieval augmented generation in bioasq 12b phase b</article-title>
          ,
          <source>CLEF Working Notes</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>O.</given-names>
            <surname>Şerbetçi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. D.</given-names>
            <surname>Wang</surname>
          </string-name>
          , U. Leser,
          <article-title>Hu-wbi at bioasq12b phase a: Exploring rank fusion of dense retrievers and re-rankers</article-title>
          ,
          <source>in: Proceedings of the Conference and Labs of the Evaluation Forum</source>
          , Grenoble, France,
          <year>2024</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>