<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NCU-IISR: Biomedical Question Answering via Gemini and GPT APIs in the BioASQ 13b Phase B Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bing-Chen Chih</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jen-Chieh Han</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hsi-Chuan Hung</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Tzong-Han Tsai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Engineering, National Central University</institution>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Medical Research, Cathay General Hospital</institution>
          ,
          <addr-line>Taipei</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Research Center for Humanities and Social Sciences</institution>
          ,
          <addr-line>Academia Sinica</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present our system and submissions for the BioASQ 13b Phase B task, continuing our eforts to improve biomedical question answering (QA) using large language models (LLMs). Building on prior work, we explored the integration of Google's Gemini API alongside OpenAI's Chat Completions API to compare and leverage the strengths of both models in the biomedical domain. Our system retains the use of Retrieval-Augmented Generation (RAG) techniques by employing file-based contextual search to retrieve relevant background documents, which are then incorporated into model prompts. We applied refined prompt engineering strategies tailored for factoid, list, and yes/no questions. Through comprehensive experiments with both LLM APIs, we identified optimal prompting patterns and response consolidation methods. Our final submission utilized a multi-model pipeline and achieved competitive results across multiple evaluation metrics, demonstrating the efectiveness of multi-model orchestration and document-grounded generation in biomedical QA.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Biomedical Question Answer</kwd>
        <kwd>Large Language Models (LLMs)</kwd>
        <kwd>Generative Pre-trained Transformer</kwd>
        <kwd>Gemini</kwd>
        <kwd>Retrieval Augmented Generation</kwd>
        <kwd>CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The BioASQ shared task [1] has been a leading benchmark for advancing biomedical semantic indexing
and question answering through its annual shared tasks since 2013. In its 13th iteration, Task 13b
Phase B challenges participants to generate both exact and ideal answers to biomedical questions using
provided textual snippets. The 2025 dataset[2] consists of 5,389 questions, comprising prior annotations
with gold-standard answers and 340 newly curated test questions. These are organized into four batches
of 85 questions each, crafted by domain experts. The task covers four question types: yes/no, factoid,
list, and summary. While all types require an ideal answer, only yes/no, factoid, and list questions
require exact answers. Participants are allowed up to five submissions per batch, facilitating iterative
refinement of their QA systems.</p>
      <p>Each instance in the BioASQ dataset includes a question, one or more relevant snippets, and
corresponding gold answers categorized into "ideal" and "exact" forms. Table 1 provides representative
examples for each question type. In previous work [3], we achieved competitive performance by
leveraging GPT-4’s language understanding capabilities in combination with tailored prompt engineering
strategies and Retrieval-Augmented Generation (RAG) techniques [4].</p>
      <p>In 2025, we extend our approach by developing multiple independent systems based on a single large
language model. Specifically, we experiment with gpt-4o[ 5] and o3-mini[6] from OpenAI, as well as
gemini-2.0-flash[ 7] from Google. Each system is designed to operate separately without combining
outputs from diferent models. All systems continued to utilize an RAG framework, employing
filebased document search to retrieve relevant biomedical content, which is incorporated into prompt
construction to enhance factual grounding. Prompt engineering techniques are further refined for each
question type to match the strengths and limitations of the underlying model. This setup allows for a
controlled comparison of LLMs under consistent conditions in the BioASQ 13b Phase B task.
Amiloride reduces arrhythmogenicity through the modulation of
KCNQ1 splicing. Therefore, the modulation of KCNQ1 splicing may
help prevent arrhythmias.</p>
      <p>Which drugs are included in the AZD7442?
[tixagevimab, cilgavimab]
AZD7442 is a combination of two long-acting monoclonal antibodies
tixagevimab and cilgavimab. It has been authorized for the prevention
and treatment of coronavirus disease 2019 (COVID-19).</p>
      <p>Olokizumab is tested for which disease?
[rheumatoid arthritis]
Olokizumab, a monoclonal antibody against interleukin 6, improves
outcomes of rheumatoid arthritis.</p>
      <p>What is the definition of dermatillomania?
Dermatillomania is a condition that leads to repetitive picking of their
skin ending up in skin and soft tissue damage. It is a chronic,
recurrent, and treatment resistant neuropsychiatric disorder with an
underestimated prevalence that has a concerning negative impact on
an individual’s health and quality of life.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The biomedical domain presents significant challenges for information access due to its vast amount of
domain-specific knowledge and complex terminology. Traditional information acquisition methods,
such as manually reading large volumes of academic literature, are time-consuming and demand high
professional expertise. This process often proves ineficient, particularly when medical professionals
and the general public require rapid access to accurate biomedical information.</p>
      <p>To address this ineficiency, Question Answering (QA) systems based on Natural Language Processing
(NLP) have gained increasing attention. By leveraging large-scale language models, these systems
can efectively interpret questions, retrieve relevant biomedical information, and generate accurate
responses. This paradigm significantly improves the accessibility and usability of biomedical knowledge,
bridging the gap between complex textual data and practical applications. With ongoing advances
in deep learning, the performance of LLM-based QA systems continues to improve, enabling more
efective support for biomedical research and clinical decision-making.</p>
      <p>Prompt Engineering has become a pivotal strategy in optimizing the performance of large language
models such as GPT, LLaMA, and their successors. It involves the deliberate design of input prompts to
elicit accurate and contextually appropriate outputs from pretrained models. Research has shown that
carefully crafted prompts can significantly enhance model performance, particularly in few-shot[ 8]
or zero-shot settings. For example, Brown et al.[8] demonstrated that prompt design substantially
improved LLM accuracy in various NLP tasks with limited examples. This approach is now widely
adopted across domains, including biomedical QA, summarization, and machine translation, where
precise model behavior is critical.</p>
      <p>Retrieval Augmented Generation integrates retrieval mechanisms with generative models to
improve the factual accuracy and relevance of generated content. First introduced by Lewis et al.[4],
RAG frameworks retrieve external documents relevant to a query and incorporate them into the
generative process, guiding the model toward more informed and grounded outputs. This method has
proven especially efective in open-domain and biomedical question answering, where domain-specific
knowledge is crucial. Recent applications of RAG continue to demonstrate its advantages in enhancing
response quality, especially when coupled with high-quality document retrieval systems and robust
prompt integration techniques.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>We adopt an RAG framework to enhance answer quality in biomedical question answering. Traditionally,
RAG consists of two components: a retriever, which identifies relevant documents based on the input
question, and a generator, which uses those documents to generate responses. In prior work, we
implemented a local retrieval pipeline using Dense Passage Retrieval (DPR), encoding both queries and
documents into dense vectors for similarity-based matching.</p>
      <p>This year, we simplify the retrieval process by utilizing the file search functionality provided by
the respective LLM platforms, which efectively abstracts the retriever component while maintaining
comparable retrieval relevance. Retrieved snippets are automatically appended to the prompt context,
supporting accurate and grounded generation. This shift allows us to focus more on model behavior
and prompt optimization without the overhead of maintaining a separate retriever infrastructure.</p>
      <p>Each system in our setup employs only a single LLM. We constructed independent systems using
OpenAI’s gpt-4o, o3-mini, and Google’s gemini-2.0-flash. We standardized the prompt structure, retrieval
strategy, and output formatting across all systems.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>
          We used the BioASQ Task 13b Phase B dataset[2], which consists of 5,389 training samples derived from
previous BioASQ tasks and newly added questions. The dataset includes four question types: summary
(
          <xref ref-type="bibr" rid="ref1">1,283</xref>
          ), factoid (
          <xref ref-type="bibr" rid="ref1">1,600</xref>
          ), list (1047), and yes/no (
          <xref ref-type="bibr" rid="ref1">1,459</xref>
          ). Each question is paired with multiple snippets
sourced from biomedical documents.
        </p>
        <p>In contrast to our previous work, where only the top five snippets were retained due to token
limitations, this year we incorporated all available snippets per question. Advances in LLM context
length and API throughput enabled this expansion, we found that including all snippets enhanced
model performance and led to more comprehensive answer generation, without degrading latency or
lfuency.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Prompt</title>
        <p>Prompt Construction. We designed our prompting strategy based on two system variants:
• Systems with file search: For models that support file-based retrieval (e.g., gpt-4o using file
search), the prompt contains only the question itself. Contextual information is automatically
injected by the system based on the indexed document snippets.
• Systems without file search: In configurations that do not utilize file search (e.g., o3-mini and
gemini-2.0-flash), we manually insert all relevant snippets directly into the prompt,
followed by the question. This allows the model to process contextual evidence inline without
access to an external retriever.</p>
        <p>Answer Generation. Each system is instructed to return both the ideal and exact answers in JSON
format. In most configurations, both answers are generated in a single step. In an alternative two-stage
pipeline, we first prompt the model to generate the ideal answer, followed by generating the exact
answer using the ideal response as intermediate context. This strategy is motivated by the observation
that entities in the exact answer often co-occur in the ideal response. To improve the accuracy of exact
answers, we use few-shot examples for each question type.</p>
        <p>Adaptation in Ideal Answer. In ideal answer, we observed a strong correlation between ideal
answers and snippet phrasing. Accordingly, we modified prompts to encourage the model to reuse
snippet segments in ideal answer verbatim. This adjustment led to improved fidelity and alignment
with gold-standard references.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Strategy</title>
        <p>
          Our approach employs two primary strategies for answer generation: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) direct generation of both
the ideal and exact answers in a single step, and (2) sequential generation in a two-stage format,
where the ideal answer is first generated and subsequently used to guide the extraction of the exact
answer. This design is conceptually aligned with the chain-of-thought paradigm [9], where intermediate
reasoning enhances output precision. While two-stage prompting can improve structure and
consistency—particularly for factoid and list questions—our experiments showed that single-stage prompting
is often suficient and more eficient in practice.
        </p>
        <p>To enhance the factual grounding of generated answers, we integrated a RAG[4] approach by
leveraging the file search functionality provided by the model platforms. This replaces the need for
a custom embedding-based retriever. In systems with file search support (e.g., gpt-4o), relevant
documents are uploaded and indexed beforehand; the model automatically incorporates pertinent
content during inference. In contrast, for systems without file search capability (e.g., o3-mini and
gemini-2.0-flash), we manually embed all relevant snippets into the prompt context.</p>
        <p>Our dataset analysis observed that ideal answers frequently included verbatim segments from the
supporting snippets. To exploit this pattern, we explicitly adjusted prompts to encourage snippet
duplication in the generation process. This adjustment led to noticeable gains in factual accuracy and
alignment with gold-standard answers. We also investigated the impact of decoding hyperparameters.
Temperature settings were tuned across systems, with temperature = 0 generally yielding more
deterministic and concise responses. However, slight temperature increases in some cases improved
lfuency or prevented overly rigid outputs.</p>
        <p>Overall, the combination of prompt refinement, strategic use of file search, and decoding control
enabled our systems to produce contextually accurate and well-structured answers for the BioASQ 13b
Phase B challenge.</p>
        <p>Table 2 provides representative examples of our prompt templates for diferent stages and question
types.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Systems</title>
        <p>We employed diferent system configurations across batches to evaluate the performance of various
language models and prompting strategies. Each system is based on a single model and follows the
RAG-based prompting architecture described in previous sections. The detailed model assignments,
prompting strategies, and use of file search for each batch are summarized in Table 3.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Result and Analysis</title>
      <p>Our evaluation results are summarized in Table 4 (Exact Answer) and Table 5 (Ideal Answer). Our system
did not demonstrate consistently strong performance in any single batch across the four evaluated
Reply to the answer clearly and easily in less than 3 sentences. You should read the
chat history’s content before answer the question. You can directly copy part of the
above snippets as part of your answer. The question is: ""
Please answer me only "yes" or "no". You should read the reference’s content before
answer the question. The question is:
Please answer me and follow the following rules: 1. Give me a list of precise key entities
to answer the question, as clear and concise as possible. 2. The list should contain entity
names, jointly taken to constitute a single answer. 3. You should read the reference’s
content and chat history before answer the question. The Question is: ""
Please answer me and follow the following rules: 1. Give me a list of precise key entities
to answer the question, as clear and concise as possible. 2. The list should contain up
to 5 entity names, ordered by decreasing confidence. 3. You should read the reference’s
content and chat history before answer the question. The Question is: ""
batches. However, we observed relatively higher results in specific question types—namely summary,
yes/no, and list questions—where several metrics showed stable and competitive performance.</p>
      <p>Conversely, our systems consistently underperformed on factoid questions, regardless of the model
or generation strategy applied. This suggests that further refinement is needed for entity-level answer
extraction.</p>
      <p>Among the various system configurations, those incorporating file-based retrieval (file search)
demonstrated superior performance compared to those relying solely on in-prompt snippets. The added
contextual grounding provided by file search enhances the model’s ability to produce accurate and
relevant responses, particularly for complex or abstract biomedical queries.</p>
      <sec id="sec-4-1">
        <title>4.1. Key Findings and Cross-Model Behaviour</title>
        <p>Across the four evaluation batches, gpt-4o with file search (IISR -2,-4,-5) achieved the most consistent
strength on yes/no and list questions, achieving up to 0.94 accuracy for yes/no (Batch-2) and 0.64 F1 for
list answers (Batch-3). These gains align with the intuition that explicit retrieval grounding mitigates
hallucination and enables concise binary or enumerative responses. By contrast, o3-mini variants
without retrieval (IISR-1,-3 in later batches) showed competitive yes/no accuracy however declined on
list F1—reflecting limited token context and weaker factual grounding.</p>
        <p>Factoid questions performance remained volatile for all models. We observed two recurring failure
• Entity granularity mismatch (e.g., returning drug class instead of molecule name)
• Partial answer coverage when the LLM stopped after one entity despite multiple correct options.</p>
        <p>File search acts as a high-recall retriever that appends all relevant sections without occupying prompt
tokens. Nevertheless, retrieval noise occasionally introduces spurious entities, which adversely afect
factoid precision, underscoring the need for efective post-filtering.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Metric Anomaly Clarification</title>
        <p>Across multiple batches (notably Batches 2 and 4) on factoid questions, some systems produced identical
aggregate scores of SAcc = LAcc = MRR = 0.5000. This behaviour stems from two design choices and
the way BioASQ computes factoid metrics:
1. Single-entity output. These systems emit exactly one entity per factoid question. In cases
where the gold standard lists several synonyms (e.g., “tamoxifen”, “TAM”) or multiple distinct
answers, our system proposes only the top candidate.
2. Metric definitions. For each question, BioASQ evaluates:
• SAcc – strict accuracy, 1 if any returned entity matches the gold list, 0 otherwise.
• LAcc – lenient accuracy, identical to SAcc when only one entity is returned.
• MRR – reciprocal rank, equal to 1/r where r is the rank of the first correct entity (thus 1
when the sole answer is correct, 0 otherwise). Because these systems return a single item,
these three per-question values are numerically identical.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this year’s participation, we explored multiple language model configurations and prompting
strategies under a unified RAG-based question answering framework. While overall performance across the
four batches did not surpass our previous year’s results, we identified several areas of relative strength
and weakness.</p>
      <p>Our systems performed better on summary, yes/no, and list questions, indicating that the models were
more efective at generating abstracted or binary content. On the other hand, factoid questions proved
challenging across all models and configurations, suggesting limitations in precise entity extraction
under current prompting schemes.</p>
      <p>The use of file search for retrieval consistently contributed to improved results. This mechanism
allowed for more flexible and scalable integration of contextual knowledge, and its impact was evident
in higher scores for supported models. By contrast, systems that embedded all snippets directly into
the prompt faced token and relevance limitations that may have afected accuracy.</p>
      <p>Our findings suggest that better handling of factoid-style questions—potentially through
entityaware prompts or post-processing could be a critical direction for future improvement. Additionally,
more adaptive retrieval methods and refined prompt segmentation strategies may further enhance the
precision and reliability of biomedical QA systems.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used ChatGPT (GPT-4o) in order to: assist in drafting,
perform grammar and style correction, and polish the text. After using this tool/service, the author
reviewed and edited the content as needed and takes full responsibility for the publication’s content.
[2] A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, BioASQ-QA: A manually curated corpus for</p>
      <p>Biomedical Question Answering, Scientific Data 10 (2023) 170.
[3] R. T. H. T. Bing Chen Chih, Jen Chieh Han, Ncu-iisr: Enhancing biomedical question answering
with gpt-4 and retrieval augmented generation in bioasq 12b phase b, CEUR Workshop Proceedings
3740 (2024) 99–105.
[4] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp
tasks, in: Advances in Neural Information Processing Systems, volume 33, 2020, pp. 9459–9474.
URL: https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.
html.
[5] OpenAI, Gpt-4o system card, 2024. URL: https://arxiv.org/abs/2410.21276. arXiv:2410.21276.
[6] OpenAI, OpenAI o3 and o4-mini System Card, Technical Report, OpenAI, 2025. URL: https:
//cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf,
accessed: 2025-05-22.
[7] S. B. J.-B. A. e. a. Gemini Team, Rohan Anil, Gemini: A family of highly capable multimodal models,
2025. URL: https://arxiv.org/abs/2312.11805. arXiv:2312.11805.
[8] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, et al., Language models are few-shot learners, arXiv preprint arXiv:2005.14165
(2020).
[9] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-of-thought
prompting elicits reasoning in large language models, 2023. arXiv:2201.11903.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          , et al.,
          <source>Bioasq</source>
          at clef2025:
          <article-title>The thirteenth edition of the large-scale biomedical semantic indexing and question answering challenge</article-title>
          ,
          <source>in: Advances in Information Retrieval. ECIR 2025. Lecture Notes in Computer Science</source>
          , volume
          <volume>15576</volume>
          , Springer, Cham,
          <year>2025</year>
          . URL: https://doi.org/10. 1007/978-3-
          <fpage>031</fpage>
          -88720-8_
          <fpage>61</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -88720-8_
          <fpage>61</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>