<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NCU-IISR: A Retrieval-Augmented Generation Approach for BioASQ 13b Phase A and A+</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jen-Chieh Han</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bing-Chen Chih</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hsi-Chuan Hung</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Tzong-Han Tsai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for GIS, Research Center for Humanities and Social Sciences</institution>
          ,
          <addr-line>Academia Sinica, Taipei</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science and Information Engineering, National Central University</institution>
          ,
          <addr-line>Taoyuan</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Medical Research, Cathay General Hospital</institution>
          ,
          <addr-line>Taipei</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>s. The initially retrieved candidate documents were further re-ranked using the BAAI/bge-rerankerv2-m3 model to identify the most relevant articles and, after sentence segmentation, the most relevant snippets. For answer generation, we employed both the meta-llama/Llama-3.1-8B-Instruct model and GPT4o. Furthermore, for the Phase A+ task, we extended the answer generation pipeline previously developed by Chih et al. for Phase B, allowing for a comparative evaluation between two distinct generation strategies.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Biomedical Question Answering</kwd>
        <kwd>Retrieval-Augmented Generation</kwd>
        <kwd>BM25</kwd>
        <kwd>BGE Reranker</kwd>
        <kwd>LLMs</kwd>
        <kwd>BioASQ</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        large-scale, high-quality datasets for general-domain QA are easier to construct, whereas datasets in
specialized domains like biomedicine are produced at a much slower pace—particularly in the case
of clinical records, which require anonymization and expert curation. Since 2012, the BioASQ
Challenge [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2-5</xref>
        ] has addressed this gap by providing annotated datasets for biomedical information
retrieval and QA. Each year’s test set builds upon previous ones, forming a cumulative benchmark
that drives progress in the field. Importantly, BioASQ incorporates both automated and manual
evaluation methods. In some cases, manual assessment of answers has led to ranking shifts among
participating systems, indicating that current automatic metrics do not fully capture human-level
comprehension or answer quality. As biomedical QA systems become more robust, they hold the
potential to assist medical professionals in making evidence-based decisions and uncovering latent
knowledge from the latest literature.
      </p>
      <p>
        To enable LLMs to provide authoritative answers, it is essential to ground their outputs in
verifiable and citable sources. This is precisely the goal of retrieval-augmented generation (RAG) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
a technique that enhances the accuracy and reliability of generative AI by incorporating information
retrieved from relevant, domain-specific sources. RAG has seen widespread adoption, reflecting both
the current capabilities and future direction of generative AI systems. In this work, we explore its
potential in the biomedical domain by participating in the BioASQ 13b Challenge, specifically
focusing on Phase A and A+—tasks that involve document retrieval and answer generation. Our
system adopts a RAG-based architecture comprising three main components: a retriever, a reranker,
and an LLM for natural language generation (NLG). For the retriever, we employ the classical
information retrieval method BM25 to index the PubMed 2024 baseline dataset. From this index, we
retrieve the top 100 documents relevant to each question. These candidates are then re-ranked using
the BAAI/bge-reranker-v2-m3 model, which is optimized for QA tasks, to identify the most relevant
documents or snippets. For answer generation, we utilize two models: the open-source
metallama/Llama-3.1-8B-Instruct and the commercial GPT-4o. We experiment with two different types
of input to the generation stage: full documents and extracted snippets. The document-based input
allows for end-to-end generation directly from retrieved texts, while the snippet-based input is
integrated with Chih et al.'s [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] previously developed Phase B system, which was designed to answer
questions based on manually annotated snippets. This dual-input approach allows us to assess the
impact of input granularity on generation efficiency and answer quality.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        From the perspective of the core architecture adopted in this study, RAG is a technical framework
that integrates LLMs with external knowledge retrieval to enhance the accuracy of QA and content
generation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Notably, multiple teams employed RAG-based systems in last year’s BioASQ 12b
challenge [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7-12</xref>
        ], demonstrating the framework’s effectiveness in the biomedical QA domain. RAG
consists of two main components: a retriever and a generator. In simple terms, before generating an
answer, the system must first retrieve relevant information from external sources through a
threestep process: indexing, retrieval, and generation. During the indexing phase, external data are
processed—typically through tokenization, vectorization, or other techniques—and stored in a
searchable database. In the retrieval phase, the user’s question is compared against this database to
identify the most relevant documents. These documents, along with the original question, are then
fed into the LLM to generate a final answer. RAG effectively links external resources to generative
AI models, functioning like in-text citations during a conversation. This approach helps reduce
hallucinations—plausible-sounding but incorrect outputs—by grounding responses in real,
retrievable sources. Moreover, because RAG-based systems retrieve information from an external
and continually updatable knowledge base, their knowledge is not limited to a static training set.
This enables them to incorporate the latest information over time, ensuring that answer quality does
not degrade due to outdated knowledge. In the biomedical domain, this dynamic and reference-based
approach offers the potential for LLMs to act as reliable assistants to professionals, supporting
decision-making with up-to-date and verifiable evidence.
      </p>
      <p>
        As the FlashRAG Toolkit [13] introduces a more modular pipeline that allows for the flexible
integration of components tailored to specific needs, we opted for a retriever-reranker setup, using
BM25 [14] for sparse retrieval and pairing it with a dedicated reranker module. BM25, a classic
termfrequency–based sparse retrieval method, remains a widely used [
        <xref ref-type="bibr" rid="ref9">9-12, 15-17</xref>
        ] and computationally
efficient approach, particularly suitable for large-scale corpora with limited computational resources.
While neural dense retrieval models such as BERT-based encoders are capable of capturing semantic
similarity, they often fall short in precise lexical matching, an area where BM25 excels. To further
improve retrieval precision, we incorporate a reranker to re-evaluate the top-ranked documents
retrieved by BM25. This component helps determine whether a document contains a snippet that
directly answers the user’s question, thereby improving answer relevance. Based on comparative
results from the previous year’s Batch 1 test set, we selected BAAI/bge-reranker-v2-m3 as our
reranker. This model—part of the M3-Embedding framework [18]—unifies several retrieval
paradigms, including dense retrieval, lexical (sparse) retrieval, and multi-vector retrieval. Notably, it
employs a novel self-knowledge distillation strategy, where relevance signals from multiple retrieval
modes are integrated as teacher supervision to improve training robustness. The reranker
demonstrates strong performance in both monolingual and cross-lingual retrieval tasks. Moreover,
its lightweight design and fast inference speed make it well-suited for practical deployment, and it
performed smoothly in our experiments.
      </p>
      <p>
        After the retrieval stage, we transitioned to the answer generation phase. Given computational
constraints, we primarily adopted the open-source meta-llama/Llama-3.1-8B-Instruct 2 model to
balance generation quality with low-latency inference. As an upgraded 8B parameter model,
Llama3.1-8B [19] supports multilingual capabilities, offers a significantly extended context window of up
to 128K tokens, and features enhanced tool usage and stronger reasoning abilities overall. Benchmark
results have shown that it outperforms GPT-3.5 Turbo in multiple tasks. While it does not yet surpass
the latest frontier models, it provides sufficient performance for constrained generation scenarios.
During the release period of the BioASQ test sets, we also incorporated GPT-4o3 into selected
configurations beginning with Batch 2, enabling a comparison between an open-source and a
proprietary model. GPT-4o matches GPT-4 Turbo in English text and code performance, while also
offering improved speed and multimodal capabilities. As previous teams have already explored the
GPT family in BioASQ tasks [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7-11, 17, 20</xref>
        ], we were particularly interested in comparing how
biomedical QA performs under these two generation backbones.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>In this section, we provide a step-by-step overview of the corpus and task dataset, system
architecture, components, LLM pipelines, and the configurations of the submitted systems.
3.1.</p>
      <sec id="sec-3-1">
        <title>Corpus and Dataset</title>
        <p>For the BioASQ 13b challenge, the Phase A and A+ training datasets consist of 5,389 QA pairs. These
included 1,459 yes/no questions, 1,600 factoid questions, 1,047 list-type questions, and 1,283 summary
questions. Each question was accompanied by a list of relevant documents, relevant snippets
(extracted from those documents), an exact answer (except for summary questions), and an ideal
answer.</p>
        <p>All associated documents and snippets were derived from the PubMed baseline corpus. The
version of the PubMed baseline used in this year's competition was released at the end of 2024,
containing a total of 38,201,553 documents. We indexed this corpus with a BM25-based retriever to
enable document retrieval. Out of the full corpus, 38,178,296 documents were successfully indexed,
while 23,257 entries were empty and excluded from retrieval.
2 https://ai.meta.com/blog/meta-llama-3-1/
3 https://openai.com/index/hello-gpt-4o/</p>
      </sec>
      <sec id="sec-3-2">
        <title>System Overview</title>
        <p>
          The overall RAG workflow of our system is illustrated in Figure 1. It consists of three main
components: consists of three main components: a retriever, a reranker, and an LLM-based answer
generation. The user question is fed into all three components ensuring that each stage in the pipeline
has direct access to the original question for context-aware processing. Depending on the output of
the reranker, the system branches into two distinct pipelines based on the types of retrieved content:
either document or snippet. In Pipeline A, the top 10 documents are directly fed into the LLM for
end-to-end answer generation. In Pipeline B, the top 10 snippets are selected instead, and then
processed through the answer generation pipeline previously developed by Chih et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which
was originally designed for snippet-based QA tasks (BioASQ b Phase B).
        </p>
        <p>In the following sections, we provide a more detailed explanation of each component and the two
pipeline variants.
For the retriever component, we adopted BM25 as a reliable lexical-based retrieval baseline, using
the 2024 PubMed baseline as our document source. We implemented our RAG system using the
Python-based FlashRAG toolkit4. To streamline processing, each PubMed article was simplified into
two fields: the article ID and its content (title and abstract). Our BM25-based retriever was
implemented using the Pyserini [21] Python toolkit within the FlashRAG framework, and we used
the FlashRAG's default parameters. The retriever first identifies the top 100 documents most relevant
to the input question, which are then passed to the reranker for further processing.
3.4.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Reranker</title>
        <p>To select our reranker component, we evaluated several open-source rerankers using the BioASQ
12b Phase A Batch 1 test set, as summarized in Table 1 (see next page). The baseline retrieval was
performed using BM25 over the PubMed 2023 baseline, since the BioASQ 12b does not include
articles from PubMed 2024. We applied each reranker to the same top 100 documents retrieved by
BM25. Among the BGE series models, BAAI/bge-reranker-v2-m3 achieved the best performance and
4 https://github.com/RUC-NLPIR/FlashRAG
was thus selected as our primary single reranker. To further explore the effectiveness of reranker
ensembles, we combined the top three rerankers from different sources—BAAI/bge-reranker-v2-m3
[18], mixedbread-ai/mxbai-rerank-large-v1 [22], and Alibaba-NLP/gte-reranker-modernbert-base
[23]. Their scores were first normalized to a 0–1 range before being aggregated and re-ranked.</p>
        <p>From the top 100 retrieved documents, the reranker then selects either the top 10 documents or
the top 10 snippets (obtained through sentence segmentation of those documents) depending on the
specific requirements of the BioASQ 13b Phase A task. We submitted three different IR
configurations for this phase: one system using only the retriever (IR5), one incorporating the single
reranker (IR1), and another using the ensemble of three rerankers (IR4). A summary of the submitted
systems for Phase A is shown in Table 2. The selected top 10 documents or top 10 snippets were
then used as input to the LLM component, forming two separate pipeline branches, which will be
detailed in the following sections.</p>
        <p>LLM</p>
        <sec id="sec-3-3-1">
          <title>Output</title>
          <p>Step 1. Top 100
Step 2. Top 10
(for Pipeline A)</p>
          <p>Top 10
(after Docs Step 1.)
(for Pipeline B)
IR1</p>
          <p>BM25</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Reranker</title>
        </sec>
        <sec id="sec-3-3-3">
          <title>Reranker</title>
        </sec>
        <sec id="sec-3-3-4">
          <title>System</title>
          <p>IR4</p>
          <p>BM25</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>3 reranker</title>
        </sec>
        <sec id="sec-3-3-6">
          <title>3 reranker</title>
          <p>IR5
BM25
BM25
Due to limited computational resources, our system was primarily developed using the open-source
meta-llama/Llama-3.1-8B-Instruct, with GPT-4o integrated during the testing phase to enhance the
answer generation performance. All experiments were run on a machine with one NVIDIA RTX 3090
and one GTX 1080 GPU. Depending on the type of input selected during the IR stage, we employed
different generation pipelines: Pipeline A used the top 10 retrieved documents as input, while
Pipeline B used the top 10 extracted snippets.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Pipeline A</title>
        <p>To ensure consistency in our experiment, we configured the LLM with a temperature of 0, aiming to
produce deterministic outputs. Additionally, we ensured that the output length was sufficient to
avoid incomplete responses.</p>
        <p>When using documents as input, we designed prompts for generating both exact and ideal
answers for Phase A+, as illustrated in Table 3. To avoid potential information loss due to long
prompt length when concatenating the top 10 documents, each document was fed into the LLM
separately. As a result, each user question has 10 exact answers and 10 ideal answers.</p>
        <p>For exact answers, we applied a simple aggregation strategy: we selected the most frequently
generated answer among the 10 outputs. In the case of a tie, the answer from the earliest ranked
document (preserving the original document order) was chosen as the final output.</p>
        <p>For ideal answers, simple majority voting was not applicable due to possible variations in
phrasing. Instead, we concatenated the 10 generated ideal answers and used the LLM again to select
the response that best addressed the question.</p>
        <p>Content
Generate a **JSON** response with the following structure. Ensure that both "exact_answer"
and "ideal_answer" fields are always included in the output:</p>
        <p>1. "exact_answer": Provide a response based on the question type. If the provided document
does not contain enough information to answer, return an empty string (""):
- **Yes/No Questions**: Answer with either "yes" or "no".</p>
        <p>- **Factoid Questions**: Provide a specific entity name (e.g., disease, drug, gene), a number,
or a similar short expression.</p>
        <p>- **List Questions &amp; Multiple Choice Questions**: Provide a list of entity names (e.g., gene
names), numbers, or similar short expressions. If the model generates a comma-separated
string instead of a list, convert it into a list format.</p>
        <p>- **Summary Questions**: Return an empty string ("") since these questions do not require
an exact answer.</p>
        <p>2. "ideal_answer": Generate a concise summary of the most relevant information. Follow
these constraints:
- **Word Limit**: The response **must not exceed 200 words** under any circumstances.
- **Content Limitation**: Only extract and summarize information from the document; do
**not** add any personal reasoning, assumptions, or explanations.</p>
        <p>- **No Guessing**: If the document does **not** provide enough relevant information,
**return an empty string ("") instead of attempting to answer**.</p>
        <p>Strictly base the answer on the provided document:
[Document]
[Question Type] question: [Question]</p>
        <p>Answer:
3.5.2.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Pipeline B</title>
        <p>
          In this pipeline, the LLM input consists of the top 10 snippets, following a process similar to that
used in BioASQ Task b Phase B. However, while Phase B provides expert-annotated snippets, Phase
A+ relies on pseudo-snippets derived from retrieved documents. To explore this setting further, we
extended our system in the later stage of the competition by incorporating the framework developed
by Chih et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which also leverages RAG techniques. This extended system was paired with
GPT4o and used for comparison with our primary design in Pipeline A.
        </p>
        <p>Table 4 (see next page) summarizes the configurations of our submitted systems for BioASQ 13b
Phase A+, categorized by pipeline type and the LLM used.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>IR Modul</title>
        <p>BM25
BM25 + Reranker</p>
        <p>Reranker
(from Sentences of BM25</p>
        <p>Top 100 Docs)</p>
        <p>3 Reranker
(from Sentences of BM25</p>
        <p>Top 100 Docs)</p>
        <p>NLG Modul
Llama-3.1-8B-Instruct</p>
        <p>IR5
IR1
Since our system was developed concurrently with the competition timeline, we faced server issues
during the early stages. As a result, complete system submissions began from Batch 3. Starting with
Batch 3, we submitted three systems for BioASQ 13b Phase A (IR1, IR4, IR5) and five systems for
Phase A+ (IR1-IR5).</p>
        <p>This section shows our preliminary results of BioASQ Task 13b. The final and official results will
be released in September, following the manual evaluation of all system responses by BioASQ experts
and the enrichment of the ground truth with potentially additional correct answers. As such,
rankings and final scores are not reported in this paper and the current results are for reference only.
4.1.</p>
        <sec id="sec-4-1-1">
          <title>Phase A Results</title>
          <p>The results for BioASQ 13b Phase A are presented in Table 5. Overall, our system performed
above the median across both document and snippet retrieval tasks. Even for IR5, our BM25-only
baseline, the scores were generally around the median which suggests that BM25 remains a widely
adopted and dependable approach for document retrieval among participating teams. Further
improvements were observed with IR1 and IR4, both of which incorporated a reranker after initial
retrieval. This shows the reranker’s benefit in refining the relevance ranking between the user
question and candidate documents. However, IR4, which employed an ensemble of three different
reranker variants, did not outperform IR1, indicating that such ensembles may bring noise issues
rather than improve ranking quality. A single, strong reranker (IR1) achieved better results. As for
snippet retrieval, while our systems still performed above the median, the gap between our best runs
and the top-performing systems was more obvious. This suggests that our current strategy which
ranks individual sentences from the retrieved documents solely with a reranker remains insufficient.
More sophisticated techniques may be necessary to improve snippet-level retrieval performance
further.</p>
          <p>Batch 1
Documents Snippets
0.4246 0.4535
0(.133/34964)
-</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Phase A+ Results</title>
          <p>The results for BioASQ 13b Phase A+ are shown in Table 6. Our system consistently achieved
abovemedian performance on ideal answers, indicating that LLMs are effective in generating long answers
compared to the more volatile results in exact answers. Among systems using the
Llama-3.1-8BInstruct model, the IR5 baseline which relies solely on BM25 performed worse across all metrics
compared to IR1, which incorporated a reranker after BM25 retrieval. This highlights the importance
of reranking in improving overall performance. Comparing IR1 and IR2, both of which used the top
10 documents (Pipeline A), we observed further improvement in ideal answers when using GPT-4o.
However, the performance was closed for exact answers between GPT-4o and Llama-3.1-8B-Instruct.
IR3, which also used GPT-4o but with the top 10 snippets (Pipeline B), showed additional gains in
Ideal answers. This suggests that integrating the snippet-based approach from Chih et al. remains
beneficial, although the impact varies by question type. In contrast, IR4 (despite also using GPT-4o)
utilized an ensemble of three different reranker models to select the top 10 snippets. The results
showed no consistent advantage over the single reranker setup in IR3, implying that reranker
ensembles may introduce noise rather than improve reliability.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Based on the preliminary results, BM25-based retriever remains a reliable performance for
information retrieval in Phase A. When combined with a reranker, performance improves further—
Best
especially in document retrieval—though there is still room for enhancement. However, for snippet
retrieval, the current setup remains underdeveloped and requires significant improvement.</p>
      <p>In the NLG stage (Phase A+), both meta-llama/Llama-3.1-8B-Instruct and GPT-4o demonstrate
strong performance in generating ideal answers. GPT-4o tends to outperform Llama-3.1-8B-Instruct
on average when using the same document inputs (Pipeline A). Moreover, when GPT-4o is provided
with snippet-based inputs (Pipeline B) and a more structured generation pipeline, scores improve
even further. That said, for exact answers, the performance between the two LLMs varies by question
type and appears highly dependent on the annotations in each batch.</p>
      <p>This competition provided valuable insights into the inherent challenges of the BioASQ task.
From a data perspective, the annual one-time update of the PubMed baseline—unlike the
continuously updated files that include new, revised, and deleted citations—poses a significant
challenge. Earlier BioASQ questions may have been annotated based on documents that have since
been modified or removed, making it more difficult for models using the most recent PubMed
baseline to answer older questions accurately. Combined with the diverse nature of the questions
and the subjective variability introduced by different annotators, maintaining stable model
performance in BioASQ is particularly difficult. We experienced this firsthand when attempts to
finetune a reranker using BioASQ’s training data failed to converge during the competition.
Acknowledgements
This research was supported by the Ministry of Science and Technology, Taiwan (112-2221-E
008-062-MY3).</p>
      <p>Declaration on Generative AI
During the preparation of this work, the author used ChatGPT in order to: Text Translation. After
using this service, the author reviewed and edited the content as needed and takes full responsibility
for the publication’s content.
[10] B.-W. Huang, "Generative large language models augmented hybrid retrieval system for
biomedical question answering," CLEF Working Notes, 2024.
[11] J. H. Merker, A. Bondarenko, M. Hagen, and A. Viehweger, "MiBi at BioASQ 2024:
retrievalaugmented generation for answering biomedical questions," in Working Notes of the Conference
and Labs of the Evaluation Forum (CLEF 2024), Grenoble, France, 2024, vol. 3740, pp. 176-187.
[12] D. Panou, A. Dimopoulos, and M. Reczko, "Farming open LLMs for biomedical question
answering," CLEF Working Notes, 2024.
[13] J. Jin et al., "Flashrag: A modular toolkit for efficient retrieval-augmented generation research,"
arXiv preprint arXiv:2405.13576, 2024.
[14] S. Robertson and H. Zaragoza, "The probabilistic relevance framework: BM25 and beyond,"</p>
      <p>Foundations and Trends® in Information Retrieval, vol. 3, no. 4, pp. 333-389, 2009.
[15] T. Almeida, R. A. Jonker, J. Reis, J. R. Almeida, and S. Matos, "BIT. UA at BioASQ 12: From</p>
      <p>Retrieval to Answer Generation," 2024.
[16] M. Lesavourey and G. Hubert, "Enhancing Biomedical Document Ranking with Domain
Knowledge Incorporation in a Multi-Stage Retrieval Approach," in 12th BioASQ Workshop at
CLEF 2024, 2024, vol. 3740.
[17] O. Şerbetçi, X. D. Wang, and U. Leser, "HU-WBI at BioASQ12B Phase A: Exploring Rank Fusion
of Dense Retrievers and Re-rankers," in Proceedings of the Conference and Labs of the Evaluation
Forum, Grenoble, France, 2024, pp. 9-12.
[18] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu, "M3-embedding: Multi-linguality,
multifunctionality, multi-granularity text embeddings through self-knowledge distillation," in
Findings of the Association for Computational Linguistics ACL 2024, 2024, pp. 2318-2335.
[19] A. Grattafiori et al., "The llama 3 herd of models," arXiv preprint arXiv:2407.21783, 2024.
[20] W. Zhou and T. H. Ngo, "Using pretrained large language model with prompt engineering to
answer biomedical questions," arXiv preprint arXiv:2407.06779, 2024.
[21] J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep, and R. Nogueira, "Pyserini: A Python toolkit for
reproducible information retrieval research with sparse and dense representations," in
Proceedings of the 44th International ACM SIGIR Conference on Research and Development in
Information Retrieval, 2021, pp. 2356-2362.
[22] A. Shakir, D. Koenig, J. Lipp, and S. Lee, "Boost your search with the crispy mixedbread rerank
models," ed, 2024.
[23] X. Zhang et al., "mgte: Generalized long-context text representation and reranking models for
multilingual text retrieval," arXiv preprint arXiv:2407.19669, 2024.
[24] S. Xiao, Z. Liu, P. Zhang, N. Muennighoff, D. Lian, and J.-Y. Nie, "C-pack: Packed resources for
general chinese embeddings," in Proceedings of the 47th international ACM SIGIR conference on
research and development in information retrieval, 2024, pp. 641-649.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          et al.,
          <source>"Gpt-4 technical report," arXiv preprint arXiv:2303.08774</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tsatsaronis</surname>
          </string-name>
          et al.,
          <article-title>"An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition,"</article-title>
          <source>BMC bioinformatics</source>
          , vol.
          <volume>16</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>28</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bougiatiotis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Paliouras</surname>
          </string-name>
          ,
          <article-title>"BioASQ-QA: A manually curated corpus for Biomedical Question Answering,"</article-title>
          <source>Scientific Data</source>
          , vol.
          <volume>10</volume>
          , no.
          <issue>1</issue>
          , p.
          <fpage>170</fpage>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Mork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          , and
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Paliouras, "The road from manual to automatic semantic indexing of biomedical literature: a 10 years journey,"</article-title>
          <source>Frontiers in Research Metrics and Analytics</source>
          , vol.
          <volume>8</volume>
          , p.
          <fpage>1250930</fpage>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          et al.,
          <source>"Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering," in Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), J.
          <string-name>
            <surname>Carrillo-de-Albornoz</surname>
          </string-name>
          et al., Eds.,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          et al.,
          <article-title>"Retrieval-augmented generation for knowledge-intensive nlp tasks,"</article-title>
          <source>Advances in neural information processing systems</source>
          , vol.
          <volume>33</volume>
          , pp.
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.-C.</given-names>
            <surname>Chih</surname>
          </string-name>
          , J.-C. Han, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Tzong-Han</surname>
          </string-name>
          <string-name>
            <surname>Tsai</surname>
          </string-name>
          ,
          <article-title>"NCU-IISR: enhancing biomedical question answering with GPT-4 and retrieval augmented generation in BioASQ 12b phase B,"</article-title>
          <source>CLEF Working Notes</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ateia</surname>
          </string-name>
          and
          <string-name>
            <given-names>U.</given-names>
            <surname>Kruschwitz</surname>
          </string-name>
          ,
          <article-title>"Can open-source LLMs compete with commercial models? Exploring the few-shot performance of current GPT models in biomedical tasks,"</article-title>
          <source>arXiv preprint arXiv:2407.13511</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>"Enhancing biomedical question answering with parameter-efficient fine-tuning and hierarchical retrieval augmented generation,"</article-title>
          <source>CLEF Working Notes</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>