<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LLM Ensemble for RAG: Role of Context Length in Zero-Shot Question Answering for BioASQ Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dima Galat</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Molla-Aliod</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Macquarie University</institution>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Technology Sydney (UTS)</institution>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Biomedical question answering (QA) poses significant challenges due to the need for precise interpretation of specialized knowledge drawn from a vast, complex, and rapidly evolving corpus. In this work, we explore how large language models (LLMs) can be used for information retrieval (IR), and an ensemble of zero-shot models can accomplish state-of-the-art performance on a domain-specific Yes/No QA task. Evaluating our approach on the BioASQ challenge tasks, we show that ensembles can outperform individual LLMs and in some cases rival or surpass domain-tuned systems - all while preserving generalizability and avoiding the need for costly ifne-tuning or labeled data. Our method aggregates outputs from multiple LLM variants, including models from Anthropic and Google, to synthesize more accurate and robust answers. Moreover, our investigation highlights a relationship between context length and performance: while expanded contexts are meant to provide valuable evidence, they simultaneously risk information dilution and model disorientation. These nfidings emphasize IR as a critical foundation in Retrieval-Augmented Generation (RAG) approaches for biomedical QA systems. Precise, focused retrieval remains essential for ensuring LLMs operate within relevant information boundaries when generating answers from retrieved documents. Our results establish that ensemble-based zero-shot approaches, when paired with efective RAG pipelines, constitute a practical and scalable alternative to domain-tuned systems for biomedical question answering.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;large language model</kwd>
        <kwd>biomedical question answering</kwd>
        <kwd>information retrieval</kwd>
        <kwd>natural language processing</kwd>
        <kwd>BioASQ</kwd>
        <kwd>CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Biomedical Question Answering (QA) is a challenging task, and BioASQ is an annual competition
aimed at fostering the development of intelligent systems specialized in Information Retrieval (IR)
and QA within the biomedical domain [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The competition consists of three distinct phases: phase A,
which focuses on a biomedical IR task; phase B, centered on QA and summarization tasks; and phase
A+, which seeks to develop an end-to-end approach for both phases of the challenge combined, has
been introduced for the first time in 2024 to encourage further development in this area [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Notably, a
third of the participating teams attempted all three phases of the challenge that year.
This paper presents the following key advancements:
• We develop a zero-shot QA ensembling framework that leverages Large Language Models (LLMs)
and answer synthesis to accomplish state-of-the-art results on a Yes/No QA task.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>IR and QA are two well-established fields in Natural Language Processing (NLP) and both of them
have had extensive research for decades. Most relevant for us is work on the use of LLMs for query
generation in IR, answer re-ranking after a first pass by a traditional retrieval system, and the use of
LLMs and RAG for QA.</p>
      <p>
        LLMs have been used to generate search queries, or to rewrite initial search queries to improve
their performance. Techniques include the use of zero-shot, few-shot and chain-of-thought prompting
[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4, 5</xref>
        ], supervised fine-tuning [6], and reinforcement learning [7].
      </p>
      <p>Search results have been reranked using a wide range of techniques, the most common being applying
a similarity metric such as cosine similarity to the output of a cross-encoder or a twin model such as
Sentence-BERT [8]. Google released the ReFr open-source framework for re-ranking search results,
which allows researchers to explore multiple features and learning methods [9].</p>
      <p>LLMs have also been used to generate answers to questions, either in a zero-shot, or few-shot manner,
or by fine-tuning on question-answering datasets. The integration of contextual information through
RAG consistently improves answer quality and factual accuracy compared to LLM approaches not
relying on additional relevant context information [10]. A quick look at the proceedings of BioASQ 12
at CLEF 2024 shows that, out of a total of 23 papers, 6 papers used the terms LLM or Large Language
Model in their title, and 3 additional papers used the terms RAG or Retrieval-Augmented Generation.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Our methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Information Retrieval Pipeline</title>
        <p>To support high-quality RAG for Phase A+, we developed an IR pipeline that integrates traditional
lexical search with LLM-based query generation and semantic reranking (Fig. 1).</p>
        <p>We index all PubMed article titles and abstracts in an Elasticsearch instance, using BM25 retrieval
as the ranking function. For each input question, we use Gemini 2.0 Flash to generate a structured
Elasticsearch query that captures the semantic intent of the question using synonyms, related terms,
and full boolean query string syntax rules supported by Elasticsearch. This query is validated using
regular expressions and then is used to retrieve up to 10,000 documents.</p>
        <p>If the initial query returns fewer than five documents, we invoke Gemini 2.5 Pro Preview (05-06)
to automatically revise the query. The model is prompted to enhance retrieval recall by enabling
approximate matching and omitting overly rare or domain-specific terms. This refinement step is done
to improve the query coverage while maintaining relevance. Our experiments have shown that this
process is required in less than 5% of the queries in the BioASQ 13 test set.</p>
        <p>Following document retrieval, we apply a semantic reranking model (Google
semantic-ranker-default004) to reduce the number of candidate documents [11]. This model re-scores the initially retrieved
documents based on semantic similarity to the original question, allowing us to select the top 300 most
relevant documents. This reranked subset is used for downstream RAG-based QA, since despite really
long context supported by modern Transformer architectures [12, 13], we could not get adequate QA
results on full article abstracts without this step.</p>
        <p>This multi-stage retrieval approach, combining LLM-generated queries, a traditional BM25 search,
and semantic reranking, enables flexible, high-recall, and high-precision document selection tailored to
complex biomedical queries.</p>
        <p>Finally, we have added additional IR searches to handle the cases where a QA step does not return
a response based on the evidence retrieved from Elasticsearch. We have observed that Elasticsearch
context might not provide suficient evidence for QA in 3-7% of test cases for Phase A+, depending
on the batch. An automated process is used to expand IR sources to address these cases. First, we are
using a Google search restricted to PubMed sources to attempt to find new matches. If that fails, we
extend our sources to include Home of the Ofice of Health Promotion and Disease Prevention, WebMD,
PubMed
corpus in
Elasticsearch</p>
        <p>Query
Generation
(Gemini
2.0 Flash)</p>
        <p>Query
Validation and IR
(BM25,≤ 10,000
docs)
Results</p>
        <p>&lt;5</p>
        <p>No
Reranking
(semanticreranker-4)</p>
        <p>Top 300
Articles
for RAG</p>
        <p>Yes</p>
        <p>Refine</p>
        <p>Query
Refinement
(Gemini
2.5 Pro)
Healthline, and Wikipedia. This ensures that we have an answer candidate for all questions in Phase A+
test sets.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Question Answering Pipeline</title>
        <p>We adopt a unified, zero-shot QA framework for both Phase A+ and Phase B of the challenge. While the
core QA procedure remains consistent across phases, Phase A+ incorporates an additional IR step to
verify the presence of candidate answers within relevant documents (described at the end of Section 3.1).
This ensures that selected documents contain suficient information to support answer generation.</p>
        <p>The system uses zero-shot prompting, tailored to the question type: Yes/No, Factoid, or List. We
experiment with multiple types of input context: (1) IR-derived results from Phase A+, (2) curated
snippets provided in Phase B, and (3) full abstracts of articles selected during Phase B. This allows us to
examine the influence of context granularity on answer accuracy and completeness.</p>
        <p>To generate candidate answers, we leverage several large language models (LLMs): Gemini 2.0 Flash,
Gemini 2.5 Flash Preview (2025-04-17), and Claude 3.7 Sonnet (2025-02-19). Prompts are adjusted using
examples derived from the BioASQ 11 test set, improving the response structure and quality.</p>
        <p>To consolidate candidate answers, we perform a secondary synthesis step using Gemini 2.0 Flash.
This model is prompted to resolve any contradictions, select the most precise and specific answer
components, and integrate complementary information into a single, unified response. As part of this
step, the model also returns a confidence score estimating the reliability of the synthesized answer. If
the score is below a predefined threshold (0.5, determined empirically), the synthesis is re-run with
reduced sampling temperature (from 0.1 to 0.0) to improve determinism. This synthesis process is
evaluated using the BioASQ 12 dataset to ensure consistency with benchmark standards.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>4.1. Phase A+
BioASQ competition results are evaluated using accuracy, which measures the proportion of correctly
answered yes/no questions, and MRR (Mean Reciprocal Rank), which assesses how early correct answers
appear in the ranked list of factoid question responses.</p>
      <p>Our IR pipeline has been actively developed during the BioASQ 13 competition, and as a result, it
was not ready in time to submit the first two batches. Notably, on batch 4, our system achieved
stateof-the-art results on Yes/No questions, underscoring the efectiveness of the RAG approach described
in Section 3.1 (Table 1). Despite this success, our system encountered challenges in producing the
structured output required for List and Factoid questions, which is a consistent issue we’ve seen with
zero-shot generation. Table 2 shows the results for factoid questions. It’s important note that Simple
truncation is a system using Gemini 2.0 Flash described in Figure 1, and Extractive is the system using
Claude 3.7 Sonnet —which worked better for long context extraction despite technically having a smaller
context size available.
4.2. Phase B
We evaluated our system on the BioASQ 12 dataset and observed competitive performance across all
batches. Overall, our system ranked 5th for exact answers across our experiments compared to systems
presented last year, with batch 4 posing the greatest challenge across all question types. Among the
models tested, Gemini consistently outperformed Claude, with Gemini 2.0 Flash showing significantly
better results compared to Gemini 2.5 Flash or Pro. The system showed its strongest performance on
Yes/No questions, while List-type questions were the most challenging. We found that using longer
contexts —like full abstracts —generally hurts the answer quality. This can be attributed to dificulties
in generating well-structured outputs for List and Factoid questions, further highlighting that answer
generation and formatting should be decoupled into separate stages to improve the results on exact
questions requiring more nuanced responses than a binary answer.</p>
      <p>Across the BioASQ 13 Phase B batches, our system demonstrated consistently strong performance on
Yes/No questions. In batches 1, 2, and 3 we achieved state-of-the-art results on Yes/No questions using
snippets provided for this stage and an ensemble of Gemini 2.0 Flash and Gemini 2.5 Flash Preview
(2025-04-17) (Table 3). In batch 3, we managed to refine our prompts to also achieve a second place
in Factoid questions using the same model and context selection (Table 4). In batch 4, this approach
ranked third on Yes/No questions (Table 3), demonstrating the robustness of our synthesis pipeline
across varied input selection strategies.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <sec id="sec-5-1">
        <title>5.1. Conclusions</title>
        <p>
          This work demonstrates the efectiveness of integrating zero-shot LLMs with traditional IR systems for
providing an end-to-end approach for QA. LLMs can help bridge the semantic gap between queries
and relevant documents and are efective when used in a specialized domain [ 14]. Our findings align
with the growing trend, as evidenced by the substantial adoption of LLM and RAG techniques in recent
BioASQ competitions [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], as well as with studies that show that query rewriting provides substantial
performance advantages [
          <xref ref-type="bibr" rid="ref3 ref4">3, 5, 4</xref>
          ].
        </p>
        <p>The multi-stage IR pipeline approach provides robust performance while maintaining computational
eficiency. We have shown that a combination of LLM query generation, cross-encoder re-ranking,
and RAG are capable of processing very long domain-specific contexts, achieving a SOTA Yes/No QA
system performance on multiple BioASQ batches this year. We found that producing structured data
outputs for other question types can be challenging, especially as the context size increases.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Future Work</title>
        <p>Several promising directions emerge from this research that warrant further investigation:</p>
        <p>Evaluation and Robustness: Developing more comprehensive evaluation frameworks that assess
not only accuracy but also consistency, bias, and hallucination rates across diverse query types and
domains [15]. ARES proposes an approach for evaluating RAG systems that fine-tunes lightweight LLM
judges using synthetic data and minimal human annotations, achieving high accuracy across tasks and
domains while outperforming prior methods [16].</p>
        <p>Interactive Systems: Development of conversational interfaces that can clarify ambiguous queries,
and provide explanations for retrieved information and generated answers [17]. Uncertainty detection
methods can be used to dynamically trigger retrieval in RAG systems, reducing unnecessary retrievals
while maintaining or even improving answer quality in long-form question answering tasks [18].</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Grammarly to assist with grammar and spelling
checks, paraphrasing, and rewording. The authors confirm that the intellectual content, analysis,
interpretations, and conclusions presented in this paper are entirely their own and take full responsibility
for the publication’s content.
[5] M. Alaofi, L. Gallagher, M. Sanderson, F. Scholer, P. Thomas, Can generative LLMs create query
variants for test collections? an exploratory study, in: Proceedings of the 46th international ACM
SIGIR conference on research and development in information retrieval, 2023, pp. 1869–1873.
[6] W. Peng, G. Li, Y. Jiang, Z. Wang, D. Ou, X. Zeng, D. Xu, T. Xu, E. Chen, Large language model
based long-tail query rewriting in taobao search, in: Companion Proceedings of the ACM Web
Conference 2024, 2024, pp. 20–28.
[7] X. Ma, Y. Gong, P. He, H. Zhao, N. Duan, Query rewriting in retrieval-augmented large language
models, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing, 2023, pp. 5303–5315.
[8] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-networks, in:
K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019,
pp. 3982–3992. URL: https://aclanthology.org/D19-1410/. doi:10.18653/v1/D19-1410.
[9] D. M. Bikel, K. B. Hall, ReFr: An open-source reranker framework, in: Interspeech 2013, 2013, pp.</p>
      <p>756–758.
[10] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
T. Rocktäschel, et al., Retrieval-augmented generation for knowledge-intensive NLP tasks, in:
Advances in Neural Information Processing Systems, volume 33, 2020, pp. 9459–9474.
[11] Google Cloud, Ranking and re-ranking search results, https://cloud.google.com/
generative-ai-app-builder/docs/ranking, 2024. Accessed: 2025.
[12] M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula,
Q. Wang, L. Yang, A. Ahmed, Big bird: Transformers for longer sequences, in: Advances in Neural
Information Processing Systems, volume 2020-December, 2020.
[13] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The long-document transformer (2020).
[14] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W.-t. Yih, Dense passage
retrieval for open-domain question answering, in: Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing, 2020, pp. 6769–6781.
[15] S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. W. Koh, M. Iyyer, L. Zettlemoyer, H. Hajishirzi,
FActScore: Fine-grained atomic evaluation of factual precision in long form text generation, in:
EMNLP, 2023. URL: https://arxiv.org/abs/2305.14251.
[16] J. Saad-Falcon, O. Khattab, C. Potts, M. Zaharia, ARES: An automated evaluation framework
for retrieval-augmented generation systems, in: Proceedings of the 2024 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, 2024, pp. 4392–4408.
[17] C. Qu, L. Yang, M. Qiu, W. B. Croft, Y. Zhang, M. Iyyer, BERT with history answer embedding
for conversational question answering, in: Proceedings of the 43rd International ACM SIGIR
Conference, 2020, pp. 1133–1136.
[18] W. Zhang, Y. Liu, H. Chen, X. Wang, To retrieve or not to retrieve? uncertainty detection for
dynamic retrieval-augmented generation, arXiv preprint arXiv:2501.09292 (2025). URL: https:
//arxiv.org/abs/2501.09292.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tsatsaronis</surname>
          </string-name>
          , G. Balikas,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          , I. Partalas,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zschunke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Alvers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petridis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Polychronopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Almirantis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Baskiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gallinari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Artiéres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Heino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gaussier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrio-Alvers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schroeder</surname>
          </string-name>
          , I. Androutsopoulos,
          <string-name>
            <surname>G. Paliouras,</surname>
          </string-name>
          <article-title>An overview of the BioASQ large-scale biomedical semantic indexing and question answering competition</article-title>
          ,
          <source>BMC Bioinformatics 16</source>
          (
          <year>2015</year>
          ).
          <source>doi:10.1186/s12859-015-0564-6.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          , G. Paliouras,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. G.</given-names>
            <surname>Sanchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Davydova</surname>
          </string-name>
          , E. Tutubalina, BioASQ at CLEF2024:
          <article-title>The Twelfth Edition of the Large-Scale Biomedical Semantic Indexing</article-title>
          and Question Answering Challenge,
          <year>2024</year>
          , pp.
          <fpage>490</fpage>
          -
          <lpage>497</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -56069-9_
          <fpage>67</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Jagerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bendersky</surname>
          </string-name>
          ,
          <article-title>Query expansion by prompting large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2305.03653</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Query2doc: Query expansion with large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2303.07678</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>