BIT.UA at BioASQ 12: From Retrieval to Answer
                         Generation
                         Notebook for the BioASQ Lab at CLEF 2024

                         Tiago Almeida1,* , Richard A. A. Jonker1 , João Reis1 , João R. Almeida1 and Sérgio Matos1
                         1
                             IEETA/DETI, LASI, University of Aveiro, Aveiro, Portugal


                                         Abstract
                                         Biomedical information retrieval and question-answering are vital for accessing and processing the ever-increasing
                                         volume of biomedical data. Effective systems in this domain are essential for researchers, clinicians, and medical
                                         experts to make well-informed decisions. The BioASQ Task B Challenge fosters the development of advanced
                                         retrieval and question-answering systems by providing a platform for evaluating and comparing diverse ap-
                                         proaches.
                                             This paper presents our participation in the twelfth edition of the BioASQ challenge, focusing on Task B. For
                                         Phase A, we employed a two-stage retrieval pipeline with the BM25 model from PISA and transformer-based
                                         neural reranking models, including PubMedBERT and BioLinkBERT. Additionally, we enhanced BM25 results with
                                         semantically similar documents using the BGE-M3 model and augmented the BioASQ training data. Outputs from
                                         these models were combined using reciprocal rank fusion (RRF). In Phases A+ and B, we utilized instruction-based
                                         transformer models such as Llama 3, Nous-Hermes2-Mixtral, and a BioASQ fine-tuned version of Gemma 2B for
                                         conditioned zero-shot answer generation. Our systems in Phase A achieved competitive results, consistently
                                         scoring on top or near the top across all batches. In Phases A+ and B, our systems remained competitive, especially
                                         in terms of Recall.

                                         Keywords
                                         Information Retrieval, Dense Retrieval, Semantic Search, Large Language model, Answer Generation, Pseudo
                                         Relevance Feedback


                         1. Introduction
                         Biomedical information retrieval and question answering are important and complex tasks, driven by
                         the need to access and process vast amounts of biomedical data. As the volume of published biomedical
                         literature grows exponentially, effective retrieval and accurate question-answering systems are crucial
                         for researchers, clinicians, and medical experts to make informed decisions.
                            The BioASQ Task B Challenge aims to push the state of the art in retrieval and question-answering
                         systems within the biomedical domain. By providing a platform for evaluating and comparing different
                         approaches, BioASQ encourages the development of innovative algorithms and techniques that can
                         handle the unique challenges posed by growing biomedical data. Participants are tasked with developing
                         systems that can efficiently retrieve relevant documents and snippets, as well as generate accurate
                         answers to biomedical questions.
                            In this paper, we outline the participation of the Biomedical Informatics and Technologies group of the
                         University of Aveiro (BIT.UA) in the 12th BioASQ challenge [1]. Our team participated in Phase A, Phase
                         B, and the newly added Phase A+. Phase A centered on information retrieval, primarily identifying the
                         top documents responding to a biomedical query. Both Phase A+ and B can be characterized as Retrieval
                         Augmented Generation (RAG) tasks, where the objective is to answer a query using a document that
                         provides relevant context.


                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ tiagomeloalmeida@ua.pt (T. Almeida); richard.jonker@ua.pt (R. A. A. Jonker); joaoreis16@ua.pt (J. Reis);
                          joao.rafael.almeida@ua.pt (J. R. Almeida); aleixomatos@ua.pt (S. Matos)
                           0000-0002-4258-3350 (T. Almeida); 0000-0002-3806-6940 (R. A. A. Jonker); 0009-0002-3579-0711 (J. Reis);
                          0000-0003-0729-2264 (J. R. Almeida); 0000-0003-1941-3983 (S. Matos)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: High-level overview of the entire system pipeline concerning the BioASQ tasks.


   Figure 1 presents an overview of our solution as an end-to-end system for the BioASQ tasks. For
Phase A, we followed the approach of Almeida et al. [2] with a two-stage retrieval pipeline. As a first
stage, we used the traditional BM25 model from PISA [3, 4], followed by transformer-based neural
reranking models, specifically PubMedBERT [5] and BioLinkBERT [6]. Additionally, we explored a
semantic search approach with the BGE-M3 [7] model in two ways: enhancing BM25 results with
semantically similar documents and augmenting the BioASQ training data with these documents. We
used reciprocal rank fusion (RRF) to ensemble outputs from the various models.
   For Phases A+ and B, we employed instruction-based transformer models such as Llama 2 [8] and
3, Nous-Hermes2-Mixtral [9], and a BioASQ fine-tuned version of Gemma 2B [10] for conditioned
zero-shot answer generation. Specifically, we utilized the top-5 most relevant articles to generate an
ideal answer and explored using relevant snippets in Phase B. Furthermore, we investigated methods to
select the optimal answer from a pool of candidates generated by these models.


2. Related Work
Biomedical information retrieval and question answering have made significant strides in recent years,
partly due to advancements in Deep Learning and Large Language Models [11], where evaluation plat-
forms like BioASQ have provided a valuable testbed for benchmarking these advancements, continually
pushing the state-of-the-art.
   Looking at the information retrieval task, two-stage retrieval approaches continue to be the most
adopted solutions [12, 13, 14, 15, 16, 17, 2, 18]. These methods combine the efficiency of sparse retrieval
models, which reduce the candidate pool using weighted keyword matching, with the efficacy of neural
reranking models, which refine the initial list by considering semantic understanding and contextual
relevance. BM25 [19] is the most popular choice for sparse retrieval, while BERT-like models [20] are
the preferred architecture for neural reranking.
   Although less popular, there were some efforts on exploring semantic search approaches, also known
as dense retrieval [21, 13, 18], as replacements for sparse retrieval methods. These efforts were mainly
motivated by the ability of dense retrieval models to capture semantic similarities between queries
and documents, overcoming the limitations of term-based models like BM25, which rely on exact term
matches. However, Ma et al. [21] demonstrated that dense retrieval alone was not capable of surpassing
BM25 in terms of retrieval performance. Despite its ability to capture semantic similarities, the dense
retrieval model struggled with exact lexical matching, which is a strength of BM25. Consequently,
combining dense retrieval with BM25 in a hybrid model was found to be more effective, leveraging the
strengths of both approaches to achieve superior results in the BioASQ challenge.
   Regarding answer generation, with the advent of Large Language Models (LLMs), most solutions now
follow a Retrieval-Augmented Generation (RAG) methodology. In this approach, relevant documents
are first retrieved and then used to generate comprehensive and accurate answers. The main differences
between the participants’ approaches lie in the models they use and how these models are configured.
Some systems directly use ChatGPT [22, 23, 24], while others employ open-source models [2, 25], each
leveraging distinct configurations to optimize performance.


3. Methodology
This section of the paper provides a detailed description of the corpora and datasets utilized throughout
the challenge, including the preprocessing steps undertaken. It then outlines the specific methodologies
adopted for each task in which we participated.

3.1. Corpus and Dataset
The BioASQ dataset [26] encompasses questions from the last eleven editions, totalling 5,049 questions.
These are classified into four categories: factoid (1,551), yes/no (1,357), summary (1,210), and list
(967). Each question is accompanied by a list of relevant documents, snippets (taken form the relevant
document), and an example of the ideal answer. For its corpus, the BioASQ challenge utilizes the
PubMed/MEDLINE annual baseline. This year, the baseline corresponds to 2024 baseline and includes
over 36 million documents.
   As evidenced by Almeida et al. [2], the continual updates and removal of documents between each
yearly baseline poses a challenge for maintaining consistency in document relevance across different
editions of the BioASQ challenge. In other words, documents relevant to questions in earlier editions
may not be present in the current edition’s document collection. To mitigate this issue and ensure
accurate training data, each question in the dataset is marked with the year it was featured. Furthermore,
we download the PubMed/MEDLINE baselines from 2013 to 2024 to maintain a clear snapshot of the
corpus as it existed when each question was initially posed. This approach enables us to align each
question with the specific corpus version available at that time.
   In terms of document preprocessing, we observed that some documents in the baselines lacked titles,
abstracts, or both. To address this, we simply removed these incomplete documents from the collection.
   Regarding the preprocessing of the training dataset, we primarily adopted two approaches. The
first, inspired by Almeida et al. [2], focused on maximizing the quality of the dataset, while the second
approach concentrated on maximizing the quantity of the dataset.

3.1.1. High-quality Dataset Preprocessing
Under the high-quality perspective, our goal is to ensure with confidence that every question in the
dataset is valid, well-written, and correctly matched with relevant documents, even if this results in a
reduction of training data.
   To achieve this, we thoroughly reviewed the dataset and found instances of repeated or very similar
questions paired with different relevant documents. Consequently, we decided to merge similar questions
along with their corresponding sets of relevant articles. For efficient and automatic merging, we utilized
the pre-trained SimCSE [27] model to calculate the similarity between questions. Questions with a cosine
similarity score above 0.99 were automatically merged, while those with a similarity score between
0.90 and 0.99 underwent manual review, in total 43 questions were merged. Furthermore, considering
the BioASQ guidelines, systems prior to the fourth edition of BioASQ could use full-text articles from
PubMed Central (PMC) for judgment. This could lead to situations where later models, lacking access to
full texts, might not have the necessary content to make accurate predictions, making the training data
actually incorrect. Therefore, we decided to remove all question pairs from before the fourth edition of
BioASQ. Additionally, the capabilities of earlier systems were arguably less sophisticated than those
available today, potentially leading to a less reliable gold standard compared to more recent editions.
At the end of this process, the refined dataset comprised 3,795 questions (a reduction of 25%), totalling
28,910 positive question-document pairs.

3.1.2. High-quantity Dataset Preprocessing
On the other hand, in the high-quantity perspective, our aim is to maximize the number of annotated
pairs, operating under the assumption that the sheer volume of data will outweigh any minor errors
present in the dataset.
   Thus, in this approach, we chose to retain all 5,049 questions in the dataset. Nevertheless, we still
recognized the need to address the previously mentioned issue concerning the use of full-text abstracts
in the early editions of the BioASQ challenge. Specifically, for questions from these early editions, we
examined the list of relevant snippets, and any document was considered positive if its relevant snippet
was derived from the article’s abstract. At the conclusion of this process, the dataset included 5,049
questions, totalling 43,732 positive question-document pairs.

3.2. Phase A
In Phase A, we participated solely in the document retrieval subtask, and this section details the
methods we employed. Inspired by the work of Almeida et al. [2], we developed a two-stage retrieval
system. The first stage utilizes an efficient sparse retrieval method, followed by neural reranking model
as the second stage. To effectively integrate the knowledge from different models, we adopted the
reciprocal rank fusion [28] (RRF) method to ensemble the outputs from various models. Additionally,
we explored methods to efficiently incorporate semantic search mechanisms into our pipeline, an aspect
not extensively explored by previous teams in the challenge.

3.2.1. First stage: Sparse Retrieval
The objective of the first stage is to efficiently retrieve the best 𝑘 candidate documents that potentially
contain an answer to a given question, referred to throughout this document as the top-𝑘 documents. To
achieve this, we utilized sparse retrievers, specifically the traditional BM25 [19] model from PISA [4, 3], a
state-of-the-art text search engine written in C++ that supports advanced WAND-like search algorithms.
   During preliminary experiments, we observed that setting 𝑘 to 1000 guarantees a recall of 90.2%,
providing the best balance between efficiency and effectiveness. The parameters for the BM25, specifi-
cally 𝑘1 and 𝑏, were selected through a preliminary hyperparameter tuning process. Specifically, we
conducted a grid search for 𝑘1 values ranging from 0.1 to 1.2 with intervals of 0.1 and 𝑏 values ranging
from 0.1 to 1.0 with intervals of 0.1. The optimal parameters 𝑘1 and 𝑏 were found to be 0.4 and 0.3,
respectively.

3.2.2. Second stage: Transformer-based Reranking
The second stage in our pipeline aims to thoroughly analyse and re-rank the top-1000 candidate
documents previously retrieved in the first stage. To accomplish this, we employ a transformer-based
cross-encoder architecture as our neural reranker model. This model encodes each question-document
pair into a CLS representation, which a classifier then uses to compute the relevance score for each pair.
We initialized our neural reranker models using the pretrained PubMedBERT [5] and BioLinkBERT [6]
weights, and trained with pointwise (cross-entropy) loss and pairwise (hinge) loss using the Trainer
API from HuggingFace.
  While this section describes the methods adopted for the second stage, we will now briefly address
other avenues that we explored but ultimately disregarded due to their lack of performance. One
problem with our previous neural reranker model architecture is that its input size is limited to 512
tokens, which is too short for some question-document pairs, forcing us to truncate the document and
potentially lose valuable information. To address this issue, we implemented a sentence-level neural
reranking model that processed any question-document pair regardless of size by splitting the document
into individual sentences that fit within the model’s size constraints. However, in all of our preliminary
experiments, this sentence-level model did not surpass the performance of our other neural reranker
models. Additionally, we explored a dynamic training regime where we used the neural reranker model
currently during training to mine for negative documents. However, this approach also failed to yield
any improvement in our preliminary results.

3.2.3. Adding Semantic Search
Semantic search, also known as dense retrieval, has become increasingly popular for performing
information retrieval, particularly due to its capability to address the vocabulary mismatch problem [29].
Despite its potential, semantic search approaches often face performance challenges and struggle to
compete, especially in contexts like the BioASQ challenge, where the exact match signal significantly
enhances sparse retrieval performance. Furthermore, searching through over 36 million documents
poses a substantial computational challenge.
   Nonetheless, the value of semantic search is undeniable, especially in identifying documents that
would otherwise be missed by sparse models. Inspired by MacAvaney et al. [30], our goal is to integrate
semantic search as a complementary method to our sparse retrieval model, rather than as a hybrid
method [31]. More precisely, we aim to identify documents semantically similar to the documents
ranked higher by our neural re-ranking model, since, according to the Cluster Hypothesis [32], similar
documents are likely to be relevant to the same question. After obtaining the semantically similar
documents, they are then ranked by our neural reranker model to determine their placement in the
final ranking order. We refer to this technique as Dense Pseudo Relevance Feedback (DPRF) and its
integration with our two-stage retrieval pipeline can be seen in Figure 2. To efficiently gauge document
similarity, we precomputed a similarity graph over the entire 2024 PubMed/MEDLINE database using
the BGE-M3 model [7]. We only recorded connections between documents with a cosine similarity
exceeding 0.85 to manage storage constraints, as recording all similarities would require petabytes
of storage. Notably, by precomputing this graph for any document in the entire collection, we can
instantaneously access a list of documents that have a similarity higher than 0.85.


Figure 2: Overview of our two-stage retrieval pipeline using Dense Pseudo Relevance Feedback (DPRF).


  Moreover, guided by the Cluster Hypothesis [32] and leveraging the new similarity graph, we propose
augmenting our training data by increasing the number of positive documents. To achieve this, for
each positive document from the dataset, we consider semantically similar documents with a similarity
score above 0.95 as potentially relevant and include them in the training data.
3.3. Phase A+ and B
For phases A+ and B, we participated only in the ideal answer generation subtask. The primary
difference between these phases is in the resources provided by the organizers. In phase B, participants
are given a list of gold documents and snippets. In contrast, for phase A+, participants must utilize
the documents retrieved during phase A. In terms of approach, similar to Almeida et al. [2], we focus
primarily on zero-shot answer generation using large language models. Furthermore, given the vast
array of available LLMs, we also explored a cost-effective method to determine which answer would be
the best.

3.3.1. Answer Generation with LLMs
For answer generation, we primarily relied on the following large language models: Nous-Hermes2-
Mixtral (referred to as Mixtral in the remaining of the paper), Llama2 70B, and Llama3 70B, which
were released during the competition. Unlike Almeida et al. [2], our approach involved exploring
multiple sources of information as context for generating an ideal answer. Specifically, we propose using
several abstracts, and in the context of phase B, snippets as the context for these LLMs. Additionally,
we encountered an issue with the length of the generated answers, as per competition criteria the
maximum length allowed was of 200 words. To comply with this, we included a word limit constraint
within the prompts, and preprocessing step of truncation. Below, we present the two main prompt
variations used:

             Act as a biomedical expert. You will receive several abstracts
             (‘[abstract: Abstract]’) summarizing research findings and
             methodologies. Along with this, a question will be
             provided(‘[question]’).

             Your role is to analyze the abstract and provide a
             scientifically accurate, concise answer to the question,
             leveraging the information from the abstracts.

             [Abstract: CONTEXT]

             [Question: QUESTION]

             Answer in less than 150 words:


            Listing 1: Prompt variation 2; detailed structure emphasizing biomedical expertise.


             Context: CONTEXT

             Question: QUESTION

             Answer in less than 150 words:


               Listing 2: Prompt variation 1; basic structure for generating concise answers.
   In these prompts, the contexts were provided as a list of up to five abstracts. We observed a benefit
in utilizing multiple documents, with five proving to be an optimal balance. This reinforces our idea
of using multiple sources of information as context. Exclusively for phase B, we experimented with
providing gold snippets instead of full abstracts, based on the rationale that feeding direct answers to the
LLM would enable it to rewrite them in natural language. Additionally, we attempted to automatically
extract exact answers with LLMs and used these as context for answer generation. However, this
approach did not yield successful results, and we subsequently abandoned it.
   Besides the zero-shot approaches, we also explored fine-tuning a large language model to produce
answers more aligned with the expectations of the BioASQ challenge. For this purpose, we utilized
the unsloth library1 to fine-tune the Gemma 2B model2 [10]. Specifically, we used the 4-bit model and
trained it using Low Rank Approximation (LoRA) [33]. For the training data, we employed the BioASQ
dataset, where we structured prompts containing the question alongside multiple document abstracts
with the goal of generating the gold-standard ideal answer. The prompt used to fine-tune the model is
as follows:

                 Below is an instruction that describes a task, paired with an
                 input that provides further context. Write a response that
                 appropriately completes the request.

                 ### Instruction:
                 Act as a biomedical expert. You will receive several abstracts
                 (‘[abstract: Abstract]’) summarizing research findings and
                 methodologies. Along with this, a question will be
                 provided(‘[question]’).

                 Your role is to analyze the abstract and provide a
                 scientifically accurate, concise answer to the question,
                 leveraging the information from the abstracts.

                 ### Input:
                 [Abstract: CONTEXT]

                 [Question: QUESTION]

                 Answer in less than 200 words:

                 ### Response


                                 Listing 3: Prompt used to fine tune the Gemma model.
  As previously mentioned, we needed to ensure that our answers contained fewer than 200 words.
Despite adding an instruction and slightly tweaking the sampling generation hyperparameters, these
adjustments could not reliably produce answers within the 200-word limit. Therefore, we propose a
second generation step that we refer to as summarization, where the objective is to take the question
and the original lengthy answer as inputs and make the answer more concise, aligning it with the
BioASQ guidelines. To accomplish this, we utilized the Nous-Hermes2-Mixtral model with the following
prompt:

                 Please provide a short and concise summary based on the
                 provided answer:

                 Question: QUESTION

                 Answer: ANSWER


                          Listing 4: Prompt used to summarise model outputs using Mixtral.
  Overall, both task phase A+ and phase B followed the methods described above, however there are
some slight variations which will be addressed in the submissions sections.

3.3.2. Answer Selection
Given the variety of prompts and models employed for answer generation, it is essential to establish a
robust method for selecting the best possible answer. For this purpose, we propose using our neural
re-ranking model from Phase A to select the ‘best’ answer. Intuitively, the model was trained to assess

1
    https://github.com/unslothai/unsloth
2
    https://huggingface.co/unsloth/gemma-2b-it-bnb-4bit
content relevance between a question and a document, making it suitably analogous for assessing the
relevance between a question and an answer. In practical terms, for any given question and set of
relevant documents, we would generate multiple answers using different combinations of prompts and
LLMs. Then, we would feed each question-answer pair into our neural re-ranking model and select the
answer with the highest score. This method also proves useful for detecting cases where the answers
are hallucinatory or nonsensical.
   One potential issue with this approach is the model’s tendency to favour longer answers, which can
be problematic given the strict word limit constraints of the BioASQ challenge. This bias arises because
the model was trained with full abstracts, which are typically more comprehensive and detailed than
the concise answers required here.


4. Results
In this section, we start by describing the evaluation metrics used throughout the paper. Then, we
introduce some validation results that help us understand the performance of the proposed methods.
Finally, we describe our submissions and the preliminary results.

4.1. Evaluation metrics
The BioASQ organizers employ well-known information retrieval metrics for evaluating Phase A
systems and text generation metrics for Phases A+ and B. These metrics are computed by comparing
the system’s predictions against a “gold standard” dataset. At the time of writing, the organizers can
only provide a preliminary “gold standard” that was created during the question construction phase.
Subsequently, medical experts will carry out a manual judgment of all system predictions, constructing
a more complete and final “gold standard”. Note that this process can take several months. Therefore,
the results presented in this paper are considered preliminary, as they are subject to change upon the
release of the final “gold standard”. It is also important to note that although the preliminary evaluation
does not fully capture the true gold standard, it still offers valuable insights.
   According to Malakasiotis et al. [34], the primary metrics for Phase A include Mean Average Precision
(MAP) and Precision. MAP assesses the quality of ranked retrieval results by evaluating the precision at
various recall levels and averaging across all queries. It is calculated as follows:
                                                         𝑛
                                                    1 ∑︁
                                           MAP =      ·  𝐴𝑃𝑖 ,                                           (1)
                                                    𝑛
                                                        𝑖=1
where 𝐴𝑃𝑖 is the average precision of the list returned for the question 𝑞𝑖 and is defined as:
                                               ∑︀|𝐿|
                                                 𝑟=1 𝑃 (𝑟) · rel(𝑟)
                                        AP =                        ,                                    (2)
                                                 𝑚𝑖𝑛(|𝐿𝑅 |, 10)
where |𝐿| is the number of items, |𝐿𝑅 | is the number of relevant items (typically 10), 𝑃 (𝑟) is the
precision when the returned list is treated as containing only its first 𝑟 items, and rel(𝑟) equals 1 if the
𝑟-th item in the list is relevant (i.e., in the “gold standard”) and 0 otherwise.
   In Phases A+ and B, the evaluation focuses on Recall and F1 Score calculated from the ROUGE metric.
ROUGE is a metric used to evaluate the quality of summaries by measuring the overlap of the generated
summary with “gold standard”. There are several versions of ROUGE. ROGUE-N, which uses 𝑛-grams to
calculate the overlap between the artificial summary 𝑆 and the reference summary 𝑅𝑒𝑓 𝑠 is calculated
as follows:
                                                     ∑︀        ∑︀
                                                        𝑅∈Refs   𝑔𝑛 ∈𝑅 𝐶(𝑔𝑛 , 𝑆, 𝑅)
                            ROUGE-N(𝑆|Refs) = ∑︀                ∑︀                  ,                     (3)
                                                         𝑅∈Refs    𝑔𝑛 ∈𝑅 𝐶(𝑔𝑛 , 𝑅)
where 𝑔𝑛 is a word 𝑛-gram, 𝐶(𝑔𝑛 , 𝑆, 𝑅) is the number of occurrences of 𝑔𝑛 in both 𝑆 and reference 𝑅
and 𝐶(𝑔𝑛 , 𝑅) is the number of occurrences of 𝑔𝑛 in 𝑅.
   ROGUE-SU4, is a variation of ROGUE-N, where S represents skip bigrams as opposed to 𝑛-grams
when computing the overlaps. The U also counts unigrams, and the 4 means a maximum distance
between the words of any skip bigram is limited to 4. We utilize ROGUE-SU4 as the automatic metric
in this work. The official metric for this task is human evaluation which includes readability, recall,
precision, and repetition.

4.2. Validation Results
In order to deepen our understanding of the proposed methodology and determine how to organize
the submissions, we conducted several validation experiments. Specifically, we explored which prepro-
cessing strategy for the dataset—high quality versus high quantity—contributed to better performance.
Additionally, we evaluated the capability of the Dense Pseudo Relevance Feedback (DPRF) method to
identify documents that were missed by the BM25 retrieval method. Regarding the training of our neural
reranking model, we aimed to understand which training methodology, pointwise or pairwise, worked
best, the impact of adding semantic data augmentation, and which model initialization checkpoint was
most effective. Regarding the validation data, we chose to use the questions and their respective “gold
standard” from the first two batches of 2023 (BioASQ 11). The primary reason for selecting these two
batches, rather than employing a standard train/validation split, is rooted in the progressive nature of
the BioASQ challenge. Over time, participant systems tend to improve, leading to increasingly effective
solutions [35]. Consequently, since the “gold standard” is developed based on medical expert judgments
of the submissions from these systems, and because better-performing systems provide more accurate
and relevant outputs for experts to review, the quality of the “gold standard” is likely to be higher with
newer batches.
   Regarding the experiments, we began by examining the performance trade-off between high-quality
and high-quantity preprocessing methods. To this end, we trained a PubMedBERT-Base model using
each resulting dataset and discovered that the high-quality preprocessing yielded better results. Specifi-
cally, the MAP from the high-quantity preprocessing was 45.36, while the high-quality preprocessing
achieved a MAP of 46.78. In machine learning, data is often viewed on a spectrum, where achieving a
balance between quality and quantity is necessary for optimal results. We believe that our high-quality
preprocessing approach strikes the best balance here, showing the importance of data quality over
quantity.
   Now discussing our Dense Pseudo Relevance Feedback (DPRF) method, we initially had concerns
that the additional overhead might not yield significant benefits and that the approach might fail to
retrieve any new documents beyond those identified by sparse retrieval. To test this, we ran the sparse
retrieval (BM25) across the entire training corpus, and then applied DPRF to the documents retrieved
by BM25 to see if it could retrieve any of the positive documents that BM25 failed to retrieve. In this
experiment, DPRF successfully retrieved an additional 2,326 documents that were not identified by the
sparse retrieval. However, there were still 4,508 documents that remained unretrievable3 . Although this
approach does not recover every document, we believe it is promising and has the potential to enhance
overall system performance.
   One of the last approaches we investigated was the use of either pointwise or pairwise training. We
observed that pairwise training slightly enhanced performance on the validation data, achieving a
MAP of 45.98 compared to 45.44 with pointwise training. Furthermore, when we added semantic data
augmentation, performance significantly improved for both approaches: 49.53 for pairwise and 49.16
for pointwise.
   Lastly, we present the general performance of the various model checkpoints tested. We only show
the validation results for pairwise models, however we reached similar conclusions for the pointwise
models. The values for the models are as follows:

        • BioLinkBERT-Large: 49.53
        • BioLinkBERT-Base: 48.45
3
    Note that this can also include documents that are no longer available, as DPRF only uses the 2024 Baseline
    • PubMedBERT-large: 48.26
    • PubMedBERT-Base: 46.53

  As expected, the large models outperformed the base counterparts, while the BioLinkBERT seems to
have the overhand against the PubMedBERT model.

4.3. Phase A Results
A summary of the runs submitted for phase A can be seen in Table 1, with a more detailed description
of the runs presented in Appendix A. It is important to mention that some of the methods previously
described were only implemented in between batches. This is why we frequently changed the configu-
ration of the runs we submitted during phase A. The results of the submissions can be seen in Table
2.

Table 1
Summary of the systems submitted for phase A. Each system is denoted with a structure specifying data source
(Q for high-quality, T for high-quantity), model type (BL for BioLinkBERT, PM for PubMedBERT), model size
(L for Large, B for Base). “2023” refers to models trained in 2023, and "All" refers to an RRF ensemble of all
the previous runs. Furthermore, each system is an RRF ensemble, with the total number of runs indicated
in parentheses. Note that in Batch 2, system-0 was trained with high-quantity data (T) instead of normal
high-quality and DPRF is applied where specified. Batch 4 includes submissions with pairwise models and
models trained over validation data. Generally, the final system in all batches uses an RRF ensemble of all runs,
with Batch 3 switching to an ensemble of systems, and Batch 4 including 2023 model submissions
 System     Batch 1               Batch 2                      Batch 3                    Batch 4
 system-0   Q: BL(L3)             T: BL(L2) + PM(B2)           Q: BL(L5+B2) + PM(L2+B6)   Q: BL(L5+B2) + PM(L2+B6)
 system-1   Q: BL(L3) + PM(L3)    System-0 + DPRF              System-0 + DPRF            System-0 + DPRF
 system-2   Q: BL(B3) + PM(B3)    Q: BL(L1+B1) + PM(L1+B2)     Q: PM(B5)                  Q: BL(L3+B5) + PM(B3) (Pairwise)
 system-3   2023 (16)             2023 (16)                    2023 (16)                  Q: BL(L3+B5) (Best on validation)
 system-4   All (28)              All (29)                     All (4)                    All+2023 (5)


Table 2
Performance metrics for various systems across different batches for phase A. Bold values represent our best
submission.
 System                       Batch 1                  Batch 2                    Batch 3                   Batch 4
                      MAP      Prec Rank      MAP       Prec Rank        MAP       Prec Rank        Map      Prec Rank
 system-0             18.00   11.51     5     22.17    10.12        4    24.87     9.80     4       37.52   12.47      6
 system-1             20.18   11.56     3     21.10     9.43        5    24.93     9.56    3        37.73   12.39      5
 system-2             20.06   12.94     4     20.41    10.85        9    24.27     9.82     5       36.90   10.47      8
 system-3             20.24   10.09     2     19.47     9.09       11    21.46     8.89    10       37.40   10.82      7
 system-4             20.67   10.39     1     20.96     8.72        6    24.15     9.10     6       39.03   10.47      3
 Best Competitor      16.12    7.06     6      22.93    9.53        1     25.49    8.59     1       39.30   10.00      1
 Median               10.74    7.18     20     11.51    7.18       25     12.50    6.40    29       17.69   8.24      24


   Regarding the results shown in Table 2, on the first batch, we trained only four models: PubMedBERT
Base, PubMedBERT Large, BioLinkBERT Base, and BioLinkBERT Large. We utilized three checkpoints
from these models for the submissions, combined with our models from last year (2023). These systems
obtained the top five results for the first batch. From this batch, we observed that the ensemble of all
runs performed best, and the large models and small models achieved similar performance (systems 1
and 2). However, relatively speaking, we believe that the systems submitted for this batch were weaker
compared to other batches, as we did not have many models to create a robust ensemble. Consequently,
our 2023 models outperformed our newly trained models in this first batch.
   For the second batch, we focused on identifying the impact of training with high-quality vs high-
quantity data. Here, system-0, which used the high-quantity training data, performed the best, contra-
dicting our validation results. However, we must emphasise that these results are still preliminary, and
the final rankings will likely change. Erroneously, the models of system-0 were trained for 10 epochs,
while the models trained with high-quality data were only trained for 5 epochs, which may cause the
discrepancy in the results observed. Furthermore, we observed that within this batch, our newly trained
models outperformed those from 2023.
   Within the third batch, we applied the knowledge from our previous batches and trained several more
models, varying the seeds and other model parameters. Notably, by using a larger number of models,
15 in case of system-0, it seems it contributed to achieve highly competitive results. Surprisingly, on
this batch, the 2023 models were drastically outperformed by our newly trained models.
   Compared to batch 3, batch 4 provided many similar runs, maintaining both systems 0 and 1,
and adding a run with pairwise models, which were not able to outperform the pointwise models.
Additionally, we included a submission containing our best performing models on validation. Our best
model from this batch was an ensemble of all our submissions combined with the 2023 models, which
outperformed our other models significantly.
   Finally, we discuss the performance of the Dense Pseudo Relevance Feedback (DPRF) method on
the submissions. As shown in the results table, DPRF outperformed its counterpart in batches 3 and
4 but failed to achieve better results in batch 2. We believe this is because batch 2 was when we first
introduced the DPRF method, and at that time, we did not have a strong intuition on how to make it
work properly. This understanding was explored and improved for batches 3 and 4. Additionally, it is
important to note that this preliminary evaluation may not favour the DPRF method. We believe that
the preliminary “gold standard” constructed by the organizers may be biased towards documents with
more exact matches between the question and the documents.
   Overall, in all batches, our systems outperformed the median in both metrics, obtaining the best MAP
in batch 1, while being within less than one percentage point of the best submission, and achieving the
best precision in batches 1, 3, and 4.

4.4. Phase A+ Results
A summary of the runs submitted for phase A+ is presented in Table 3, with a more detailed description
of the runs available in Appendix B. A major challenge here was deciding which source of positive
documents from Phase A to use, as it was crucial to maintain consistency in the input source for a fair
comparison of the LLMs’ performance when generating answers. The results of these submissions are
shown in Table 4.

Table 3
Summary of the systems submitted for phase A+. The first number corresponds to the system number used
as input source for the answer (e.g., 0 corresponds to system-0 from phase A, All corresponds to an ensemble
using all sources), L refers to llama (either 2 or 3), M refers to the mixtral model, G refers to Gemma, and summ.
refers to summarization using mixtral. The number in parentheses refers to how many runs are present in the
ensemble, if no value is specified, 1 is assumed.
              System      Batch 1            Batch 2             Batch 3           Batch 4
              system-0    0 - L2 + M (4)     4 - L2 + M (3)      1 - L3 + M (4)    1 - L3 + M (4)
              system-1    1 - L2 + M (4)     All - M (8)         2 - L3 + M (4)    2 - L3 + M (4)
              system-2    4 - L2 + M (3)     All - L2 + M (8)    4 - L3 + M (4)    4 - L3 + M (4)
              system-3    All - M (8)        All - Top 5 G (5)   4 - L3 + summ.    4 - L3 + summ.
              system-4    All - L2 + M (8)   All - Top 1 G (5)   4 - L3 + summ.    4 - L3 + summ.


  Also, it is important to note that between batches 2 and 3, the Llama 3 model was released as the
successor to Llama 2. This release prompted us to switch from Llama 2 to Llama 3 for the remaining
batches.
  For the first batch submission, our primary goal was to understand the impact of the input source
documents on our downstream task of answer generation. Upon reviewing the results, it appears
that the source documents did not significantly impact performance, as almost all systems scored
Table 4
Performance metrics for various systems across different batches. The metrics reported is the ROGUE-SU4 recall
and F1, where the rank is ordered by recall. Bold values represent our best submission.
 System                     Batch 1                 Batch 2                 Batch 3                     Batch 4
                    REC       F1    Rank    REC       F1    Rank    REC       F1    Rank        REC       F1    Rank
 system-0           33.21    6.79    4      28.85     8.15   3      34.73    11.21     1        32.19    9.83     2
 system-1           33.43    7.46    1      28.98     8.06   2      32.83    10.44     4        29.84    9.13     3
 system-2           31.39    7.74    5      29.27     7.78   1      34.04    10.75     2        32.61   10.96     1
 system-3           33.36    7.30    2      22.92    18.98   10     24.45    12.26     14       23.11   11.77    18
 system-4           33.24    6.93    3      17.06    13.81   20     27.11    13.55     9        22.19   11.03    19
 Best Competitor    28.07    15.72   6      24.50    10.24   4      33.34    14.70     3        27.95    24.31    3
 Median             25.29    11.52   11     19.59    11.04   13     24.53    15.84     14       23.90    13.08   14


closely. The only exception was system-2, which achieved a slightly lower ROUGE-SU4 (Recall) score
but compensated with the best ROUGE-SU4 (F1) score. This outcome gives us more flexibility in the
upcoming batches to focus primarily on the performance of the LLMs.
   For Batch 2, we primarily focused on testing the performance of the fine-tuned Gemma models. We
discovered that these models benefited from having access to more source documents. However, their
recall metrics were significantly worse than our other models, although their F1 scores showed the best
performance within the batch. Unfortunately, during the competition, we misinterpreted the Gemma
results as underperforming, which led us to not use the fine-tuned approaches in the upcoming batches.
   For Batches 3 and 4, we maintained the same submissions but shifted our focus to the summarization
technique. According to the metrics, the systems employing summarization achieved higher ROUGE-
SU4 (F1) scores at the expense of ROUGE-SU4 (Recall). We attribute this to the summarization producing
shorter sentences, which naturally would decrease the recall while potentially increasing the precision
in case of correct answers.
   Discussing the results within the competition across all batches, our models consistently achieved
the best ROUGE-SU4 (Recall) scores. However, these models generally underperformed in terms of the
ROUGE-SU4 (F1) metric, likely due to our bias towards generating longer answers. The only exception
to this trend were the fine-tuned Gemma models, which produced higher F1 scores but significantly
lower recall, which should be further investigated in future works.

4.5. Phase B Results
A summary of the runs submitted for phase B can be seen in Table 5, with a more detailed description
of the runs presented in the Appendix C. The results of the submissions can be seen in table 6. In this
phase, we conducted experiments similar to those in Phase A+, with the primary difference being the
source of input. Unlike in Phase A+, where inputs were derived from our retrieval models from Phase
A, for Phase B, we used the “gold standard” abstracts and snippets provided by the organizers.

Table 5
Summary of the systems submitted for phase B. The first value corresponds to the source of answer generation
either abstract (Abs.), snippet (Snipp.), or exact answer (EA). L refers to llama (either 2 or 3), M refers to the
mixtral model, G refers to Gemma (either using Top 1 abstract, Top 5 or both), and summ. refers to summarization
using mixtral. The number in parentheses refers to how many runs are present in the ensemble, if no value is
specified, 1 is assumed.
        System       Batch 1               Batch 2                 Batch 3                  Batch 4
        system-0     Abs - L2 (2)          Abs - L2+M (4)          Snipp. - L3+M (4)        Snipp. - L3+M (4)
        system-1     Abs - L2+M (4)        Snipp. - L2+M (4)       summ(system-0)           summ(system-0)
        system-2     Snipp. - L2+M (4)     Abs - Top 1 G           Abs - L3+M (4)           Abs - L3+M (4)
        system-3     EA - L2+M (2)         Abs - Top 5&1 G (2)     summ(system-2)           summ(system-2)
        system-4     All (16)              All (10)                Snipp.+Abs - L3          Snipp.+Abs - L3
Table 6
Performance metrics for our submissions for phase B. The metrics reported is the ROGUE-SU4 recall and F1,
where the rank is ordered by recall. Bold values represent our best submission.
 System                    Batch 1                Batch 2                Batch 3                Batch4
                   REC       F1    Rank   REC       F1    Rank   REC       F1    Rank   REC       F1   Rank
 system-0          38.63    8.96    4     35.96   11.04    9     49.72   17.12    6     45.13   16.59    4
 system-1          38.71    8.45    3     39.84   13.23    4     32.22   18.07    24    29.47   15.02   29
 system-2          41.19    9.47    1     14.01   11.89    35    48.27   16.38    7     42.33   14.53    6
 system-3          28.97    8.84    21    19.67   16.34    31    32.07   16.53    26    26.73   14.04   31
 system-4          36.67    8.65    8     35.22   11.34    12    33.99   19.58    22    28.28   15.33   30
 Best Competitor   39.24    13.69   2     44.62    40.35   1     52.25    36.43   1     49.05   21.44    1
 Median            31.91    19.11   16    27.34    17.31   19    35.27    23.30   20    34.90   21.60   21


   Similar to Phase A+, in the first batch, we primarily focused on understanding the impact of the
input source on the models, examining abstracts, snippets, and exact answers. The performance of
the snippets was particularly strong compared to the entire abstracts. Additionally, when comparing
system-0 and system-1, we observed that incorporating the Mixtral models into the ensemble proved
beneficial. Furthermore, as mentioned, we experimented with using exact answers as our source. This
method involved using an LLM for zero-shot extraction of exact answers. However, this approach was
not very successful, leading us to discard it in subsequent batches and not include it in this manuscript.
   In the second batch, we primarily tested the performance of the fine-tuned Gemma model. Similar to
Phase A+, this model exhibited lower ROGUE-SU4 (Recall) but higher ROGUE-SU4 (F1) scores compared
to the Llama and Mixtral models. However, given that the ROGUE-SU4 (F1) score was significantly
lower than that of the best-performing model, we opted not to continue investigating this model in
Phase B. Once again, the snippet submissions outperformed the abstract submissions, showing their
effectiveness.
   In batches 3 and 4, we maintained the same submission strategy, with an increased emphasis on sum-
marization techniques and more submissions utilizing snippets. We observed that these summarization
techniques tend to improve ROUGE-SU4 (F1) scores at the expense of ROUGE-SU4 (Recall). Ideally,
these techniques should undergo human evaluation before drawing definitive conclusions. Additionally,
we found that snippets slightly outperformed abstracts alone; however, the smaller context fed into
the models led to reduced inference time. Our final submission in these batches did not utilize an
ensemble of all models but instead employed the newer Llama3 model with snippets, which resulted in
surprisingly low performance. This outcome suggests that using both Llama and Mixtral models in an
ensemble indeed enhances performance.
   In all batches, our best models were obtained by using snippets, obtaining a top 1 recall in Batch 1.
Similarly to phase A+ the Gemma model obtained a higher ROUGE-SU4 (F1), however in comparison to
the competition, the ROUGE-SU4 (F1) was still below the median of the competition, which is why we
focus more on our models which were performing well on recall. Comparing the results of phase A+ to
phase B, we can see a significant improvement, indicating that the gold standard documents plays an
impact in the generation.


5. Error Analysis on Answer Generation
In this section, we aim to provide more insights into the real performance of the language models used
in phase A+ and phase B. For that, we start by showing, in Figure 3, examples of answer from our
strategies and models previously described in the methodology section:


Figure 3: Output from various models for the question: “Are FOLFIRINOX plus Bevacizumab the
          preferred therapy for left-sided RAS wild type advanced colorectal cancer with aggressive
          clinical presentation?”.
Llama 2 - prompt 1
                                                    Llama 2 - prompt 2
No, FOLFOXIRI-bevacizumab is not the pre-
ferred therapy for left-sided RAS wild-type         No, FOLFIRINOX plus Bevacizumab is not the
advanced colorectal cancer with an aggres-          preferred therapy for left-sided RAS wild-type
sive clinical presentation. The study showed        advanced colorectal cancer with an aggres-
that FOLFOXIRI-bevacizumab achieved numer-          sive clinical presentation. The study showed
ically superior survival outcomes compared          that FOLFOXIRI-bevacizumab and FOLFOX-
to FOLFOX-panitumumab, but the difference           panitumumab have similar efficacy in terms of
was not statistically significant. Additionally,    progression-free survival and overall survival,
FOLFOXIRI-bevacizumab was associated with           but FOLFOXIRI-bevacizumab had a numerically
a higher incidence of chemotherapy-related ad-      superior survival outcome. However, the study
verse events, particularly grade 3 and 4 neu-       also showed that FOLFOXIRI-bevacizumab had
tropenia. Therefore, doublet chemotherapy plus      more frequent chemotherapy-related adverse
anti-EGFRs remains the preferred treatment for      events, specifically grade 3 and 4 neutropenia.
patients with left-sided RAS/BRAF wild-type         Therefore, the choice between the two regimens
mCRC, as it provides similar outcomes with less     should be based on patient preference and po-
toxicity.                                           tential impact on quality of life.


                                                    Mixtral summarization of Llama 2 - prompt
                                                    2
Mixtral summarization of Llama 2 - prompt           Short Summary: FOLFOXIRI-bevacizumab is
1                                                   not considered superior to doublet chemother-
                                                    apy plus anti-EGFRs for left-sided RAS/BRAF
No, FOLFIRINOX plus Bevacizumab is not the          wild-type metastatic colorectal cancer with ag-
recommended therapy for aggressive left-sided       gressive clinical presentation due to numerically
RAS wild type advanced colorectal cancer due        better survival outcomes but no statistical sig-
to similar survival rates but higher incidence of   nificance and higher toxicity. Therefore, the pre-
adverse events compared to doublet chemother-       ferred treatment remains doublet therapy plus
apy and anti-EGFR treatment.                        anti-EGFRs for such cases.


Gemma - Top 5                                       Gemma - Top 1

No, FOLFIRINOX plus Bevacizumab is not the          Yes, FOLFIRINOX plus Bevacizumab is the pre-
preferred therapy for left-sided RAS wild type      ferred therapy for left-sided RAS wild type ad-
advanced colorectal cancer with aggressive clin-    vanced colorectal cancer with aggressive clinical
ical presentation.                                  presentation.
                                                         Mixtral - prompt 2

                                                         Based on the abstract provided, the combina-
   Mixtral - prompt 1                                    tion of FOLFOXIRI-bevacizumab (which is sim-
                                                         ilar to FOLFIRINOX) and panitumumab were
   The text does not provide a direct answer             compared in patients with left-sided RAS/BRAF
   to this question, as it focuses on compar-            wild-type metastatic colorectal cancer (mCRC).
   ing FOLFOXIRI-bevacizumab and FOLFOX-                 The study found no significant difference in
   panitumumab in left-sided RAS/BRAF wild-              progression-free survival, overall survival, over-
   type mCRC. However, it does mention that ini-         all response rate, or disease control rate between
   tial therapy with FOLFOXIRI-bevacizumab is            the two treatment groups. However, there
   superior to doublets plus bevacizumab inde-           was a numerically superior survival outcome
   pendently from primary tumor sidedness and            for FOLFOXIRI-bevacizumab, and it achieved a
   RAS/BRAF status. FOLFIRINOX is a similar reg-         higher resection rate of metastases.
   imen to FOLFOXIRI, and both include all three         Regarding toxicity, more chemotherapy-related
   drugs (5-fluorouracil, leucovorin, and irinote-       adverse events were observed in the FOLFOXIRI-
   can) but differ in the administration of oxali-       bevacizumab group, particularly grade 3 and 4
   platin (given in FOLFOXIRI and not in FOLFIRI-        neutropenia. The abstract does not specifically
   NOX).                                                 mention aggressive clinical presentation as a
   Considering this information, it is plausible to      criterion for treatment selection. Nevertheless,
   consider FOLFIRINO" (truncated)                       both treatments appear to (truncated)


   Looking first at an overview of all the models, we notice that Gemma provides the shortest outputs,
offering very brief answers to questions, while sometimes lacking important information. After Gemma,
we have the summarization technique which also provides short answers, while still maintaining some
relevant information. Following this, it can be seen that both Mixtral and Llama provide similar length
outputs, with Mixtral occasionally producing shorter responses.
   Looking more closely at each of the systems, starting with Gemma, we can see an important difference
between using only one document and using five documents as sources. The answer to the question
completely changed, indicating that more information is often needed to answer the questions. We also
observed that the Gemma models often output insufficient information, typically lacking a detailed
answer justification.
   Next, examining the Llama outputs, we observe very detailed answers, clearly explaining the responses
to the questions. We also see that the prompt variation had limited impact, showing the robustness
of the model. Looking now at the summaries created by Mixtral, we can see that it may offer some of
the most valuable outputs. However, it may create some additional artifacts, such as “Short Summary”,
though the approach appears promising.
   Finally, examining the performance of the Mixtral models, we can see that they lack some consistency
and are prone to several problems. In both cases, the output generation was not actually complete;
multiple paragraphs were generated, and a direct answer to the question was not directly stated. This
indicates that Mixtral may not be the best solution here, unlike what was demonstrated within previous
results.


6. Conclusion
In this paper, we detail our team’s participation in the twelfth edition of the BioASQ challenge. Overall,
our team’s performance was competitive, achieving top results in each of the tasks we participated in.
Specifically, within phase A, we highlight our use of DPRF, which yielded promising results, as well as
the addition of documents from semantic search to the training data. For the tasks in phases A+ and B,
we presented a fine-tuned Gemma model and a summarization technique, which may pave the way
for interesting future work. More specifically, within Phase B, we emphasize that the use of snippets
significantly increases the performance of the models compared to those using the full abstract.
Acknowledgments
This work was funded by the Foundation for Science and Technology (FCT) in the context of the project
doi.org/10.54499/UIDB/00127/2020. Tiago Almeida is funded by the grant doi.org/10.54499/2020.05784.
BD. Richard A. A. Jonker is funded by the grant PRT/BD/154792/2023. This work was funded by FCT
I.P. under the project Advanced Computing Project 2023.10766.CPCA.A0, platform Vision at University
of Évora.


References
 [1] A. Nentidis, G. Katsimpras, A. Krithara, S. Lima-López, E. Farré-Maduell, M. Krallinger,
     N. Loukachevitch, V. Davydova, E. Tutubalina, G. Paliouras, Overview of BioASQ 2024: The
     twelfth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering,
     in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. Maria Di Nunzio, P. Galuščáková,
     A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
     Multimodality, and Interaction. Proceedings of the Fifteenth International Conference of the CLEF
     Association (CLEF 2024), 2024.
 [2] T. Almeida, R. A. A. Jonker, R. Poudel, J. M. Silva, S. Matos, BIT.UA at BioASQ 11B: Two-Stage
     IR with Synthetic Training and Zero-Shot Answer Generation, in: M. Aliannejadi, G. Faggioli,
     N. Ferro, M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum
     (CLEF 2023), Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2023, pp. 37–59. URL: https://ceur-ws.org/Vol-3497/paper-004.pdf.
 [3] A. Mallia, M. Siedlaczek, J. Mackenzie, T. Suel, PISA: performant indexes and search for academia,
     in: Proceedings of the Open-Source IR Replicability Challenge co-located with 42nd International
     ACM SIGIR Conference on Research and Development in Information Retrieval, OSIRRC@SIGIR
     2019, Paris, France, July 25, 2019., 2019, pp. 50–56. URL: http://ceur-ws.org/Vol-2409/docker08.pdf.
 [4] S. MacAvaney, C. Macdonald, A python interface to PISA!, SIGIR ’22, Association for Computing
     Machinery, New York, NY, USA, 2022, p. 3339–3344. URL: https://doi.org/10.1145/3477495.3531656.
     doi:10.1145/3477495.3531656.
 [5] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon, Domain-
     specific language model pretraining for biomedical natural language processing, ACM Trans.
     Comput. Healthcare 3 (2021). URL: https://doi.org/10.1145/3458754. doi:10.1145/3458754.
 [6] M. Yasunaga, J. Leskovec, P. Liang, LinkBERT: Pretraining language models with document links,
     in: S. Muresan, P. Nakov, A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the
     Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational
     Linguistics, Dublin, Ireland, 2022, pp. 8003–8016. URL: https://aclanthology.org/2022.acl-long.551.
     doi:10.18653/v1/2022.acl-long.551.
 [7] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, Z. Liu, Bge m3-embedding: Multi-lingual, multi-
     functionality, multi-granularity text embeddings through self-knowledge distillation, 2024.
     arXiv:2402.03216.
 [8] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
     gava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes,
     J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan,
     M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril,
     J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton,
     J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan,
     B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur,
     S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2: Open foundation and fine-tuned
     chat models, 2023. arXiv:2307.09288.
 [9] Teknium, theemozilla, karan4d, huemin_art, Nous hermes 2 mistral 7b dpo, 2024.
     URL:             [https://huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO](https:
     //huggingface.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO).
[10] G. DeepMind, Gemma: Open weights llm from google deepmind, https://github.com/
     google-deepmind/gemma, 2024. Accessed: 2024-05-29.
[11] Q. Jin, Z. Yuan, G. Xiong, Q. Yu, H. Ying, C. Tan, M. Chen, S. Huang, X. Liu, S. Yu, Biomedical
     question answering: A survey of approaches and challenges, ACM Comput. Surv. 55 (2022). URL:
     https://doi.org/10.1145/3490238. doi:10.1145/3490238.
[12] T. Almeida, S. Matos, BIT.UA at bioasq 8: Lightweight neural document ranking with zero-shot
     snippet retrieval, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes of
     CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25,
     2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL: https://ceur-ws.org/
     Vol-2696/paper_161.pdf.
[13] J. Lu, J. Ma, K. B. Hall, Zero-shot hybrid retrieval and reranking models for biomedical literature,
     in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Proceedings of the Working Notes of
     CLEF 2022 - Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th - to -
     8th, 2022, volume 3180 of CEUR Workshop Proceedings, CEUR-WS.org, 2022, pp. 281–290. URL:
     https://ceur-ws.org/Vol-3180/paper-19.pdf.
[14] T. Almeida, S. Matos, Universal passage weighting mecanism (UPWM) in bioasq 9b, in: G. Faggioli,
     N. Ferro, A. Joly, M. Maistro, F. Piroi (Eds.), Proceedings of the Working Notes of CLEF 2021 -
     Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21st - to - 24th,
     2021, volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 196–212. URL: https:
     //ceur-ws.org/Vol-2936/paper-13.pdf.
[15] D. Pappas, R. McDonald, G.-I. Brokos, I. Androutsopoulos, Aueb at bioasq 7: Document and
     snippet retrieval, in: P. Cellier, K. Driessens (Eds.), Machine Learning and Knowledge Discovery
     in Databases, Springer International Publishing, Cham, 2020, pp. 607–623.
[16] T. Almeida, A. Pinho, R. Pereira, S. Matos, Deep learning solutions based on fixed contextualized
     embeddings from pubmedbert on bioasq 10b and traditional IR in synergy, in: G. Faggioli, N. Ferro,
     A. Hanbury, M. Potthast (Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference
     and Labs of the Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of
     CEUR Workshop Proceedings, CEUR-WS.org, 2022, pp. 204–221. URL: https://ceur-ws.org/Vol-3180/
     paper-12.pdf.
[17] M. Lesavourey, G. Hubert, Bioasq 11b: Integrating domain specific vocabulary to bert-based model
     for biomedical document ranking, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.),
     Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki,
     Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org,
     2023, pp. 145–151. URL: https://ceur-ws.org/Vol-3497/paper-012.pdf.
[18] A. Shin, Q. Jin, Z. Lu, Multi-stage literature retrieval system trained by pubmed search logs
     for biomedical question answering, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.),
     Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki,
     Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org,
     2023, pp. 178–189. URL: https://ceur-ws.org/Vol-3497/paper-016.pdf.
[19] S. Robertson, H. Zaragoza, The probabilistic relevance framework: Bm25 and beyond, Foundations
     and Trends® in Information Retrieval 3 (2009) 333–389. URL: http://dx.doi.org/10.1561/1500000019.
     doi:10.1561/1500000019.
[20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
     for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
     Conference of the North American Chapter of the Association for Computational Linguistics:
     Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423.
     doi:10.18653/v1/N19-1423.
[21] J. Ma, I. Korotkov, K. B. Hall, R. T. McDonald, Hybrid first-stage retrieval models for biomedical
     literature, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Névéol (Eds.), Working Notes of CLEF
     2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22-25,
     2020, volume 2696 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. URL: https://ceur-ws.org/
     Vol-2696/paper_92.pdf.
[22] C. Hsueh, Y. Zhang, Y. Lu, J. Han, W. Meesawad, R. T. Tsai, NCU-IISR: prompt engineering on
     GPT-4 to stove biological problems in bioasq 11b phase B, in: M. Aliannejadi, G. Faggioli, N. Ferro,
     M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023),
     Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2023, pp. 114–121. URL: https://ceur-ws.org/Vol-3497/paper-009.pdf.
[23] S. Ateia, U. Kruschwitz, Is chatgpt a biomedical expert?, in: M. Aliannejadi, G. Faggioli, N. Ferro,
     M. Vlachos (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023),
     Thessaloniki, Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2023, pp. 73–90. URL: https://ceur-ws.org/Vol-3497/paper-006.pdf.
[24] H. Kim, H. Hwang, C. Lee, M. Seo, W. Yoon, J. Kang, Exploring approaches to answer biomedical
     questions: From pre-processing to GPT-4, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.),
     Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki,
     Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org,
     2023, pp. 132–144. URL: https://ceur-ws.org/Vol-3497/paper-011.pdf.
[25] D. Galat, M. Rizoiu, Enhancing biomedical text summarization and question-answering: On the
     utility of domain-specific pre-training, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.),
     Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki,
     Greece, September 18th to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org,
     2023, pp. 102–113. URL: https://ceur-ws.org/Vol-3497/paper-008.pdf.
[26] A. Krithara, A. Nentidis, K. Bougiatiotis, G. Paliouras, BioASQ-QA: A manually curated corpus for
     Biomedical Question Answering, Scientific Data 10 (2023) 170.
[27] T. Gao, X. Yao, D. Chen, SimCSE: Simple contrastive learning of sentence embeddings, in: M.-F.
     Moens, X. Huang, L. Specia, S. W.-t. Yih (Eds.), Proceedings of the 2021 Conference on Empirical
     Methods in Natural Language Processing, Association for Computational Linguistics, Online
     and Punta Cana, Dominican Republic, 2021, pp. 6894–6910. URL: https://aclanthology.org/2021.
     emnlp-main.552. doi:10.18653/v1/2021.emnlp-main.552.
[28] G. V. Cormack, C. L. A. Clarke, S. Buettcher, Reciprocal rank fusion outperforms condorcet and
     individual rank learning methods, in: Proceedings of the 32nd International ACM SIGIR Conference
     on Research and Development in Information Retrieval, SIGIR ’09, Association for Computing
     Machinery, New York, NY, USA, 2009, p. 758–759. URL: https://doi.org/10.1145/1571941.1572114.
     doi:10.1145/1571941.1572114.
[29] G. W. Furnas, T. K. Landauer, L. M. Gomez, S. T. Dumais, The vocabulary problem in human-system
     communication, Commun. ACM 30 (1987) 964–971. URL: https://doi.org/10.1145/32206.32212.
     doi:10.1145/32206.32212.
[30] S. MacAvaney, N. Tonellotto, C. Macdonald, Adaptive re-ranking with a corpus graph, in:
     Proceedings of the 31st ACM International Conference on Information & Knowledge Management,
     CIKM ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 1491–1500. URL:
     https://doi.org/10.1145/3511808.3557231. doi:10.1145/3511808.3557231.
[31] Y. Luan, J. Eisenstein, K. Toutanova, M. Collins, Sparse, Dense, and Attentional Represen-
     tations for Text Retrieval, Transactions of the Association for Computational Linguistics
     9 (2021) 329–345. URL: https://doi.org/10.1162/tacl_a_00369. doi:10.1162/tacl_a_00369.
     arXiv:https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00369/1924040/tacl_a
[32] O. Kurland, The cluster hypothesis in information retrieval, in: M. de Rijke, T. Kenter, A. P.
     de Vries, C. Zhai, F. de Jong, K. Radinsky, K. Hofmann (Eds.), Advances in Information Retrieval,
     Springer International Publishing, Cham, 2014, pp. 823–826.
[33] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank
     adaptation of large language models, 2021. arXiv:2106.09685.
[34] P. Malakasiotis, I. Pavlopoulos, I. Androutsopoulos, A. Nentidis, Evaluation measures for task
     b, In BioASQ-EvalMeasures-taskB (Version 1.1). Intelligent Information Management, Targeted
     Competition Framework, ICT-2011.4.4(d), Project FP7-318652 / BioASQ, 2020. Retrieved from
     http://www.bioasq.org.
[35] A. Nentidis, G. Katsimpras, A. Krithara, G. Paliouras, Overview of bioasq tasks 11b and synergy11
     in CLEF2023, in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the
     Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th
     to 21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 19–26. URL:
     https://ceur-ws.org/Vol-3497/paper-003.pdf.


A. Phase A runs
A.1. Batch 1
All models in this batch were trained in a pointwise fashion.
    • system-0: Ensemble of 3 BioLinkBERT-Large checkpoints.
    • system-1: Ensemble of 3 BioLinkBERT-Large checkpoints and 3 PubMedBERT-Large checkpoints
      (6 total).
    • system-2: Ensemble of 3 BioLinkBERT-Base checkpoints and 3 PubMedBERT-Base checkpoints
      (6 total).
    • system-3: Our 2023 submission excluding some models (16 total)
    • system-4: Ensemble of all (28 total)

A.2. Batch 2
All newly trained models were trained for 10 epochs in a pointwise fashion. we were testing our new
training data source.
    • system-0: Ensemble of 2 BioLinkBERT large models and two PubMedBERT Base models with
      varying seeds, trained with “new data” (4 total).
    • system-1:DPRF applied to each system-0 submission.
    • system-2: Ensemble of models trained on the previous data, PubMedBERT-Base, Large and
      BioLinkBERT-Base, Large. two PubMedBERT-Base modesl totaling 5 models.
    • system-3: Our 2023 submission excluding some models (16 total)
    • system-4: Ensemble of all (29 total)

A.3. Batch 3
All models utilize old data from this batch onwards.
    • system-0: Ensemble of 15 models: BiolinkBERT-Large x5, BiolinkBERT-Base x2, PubMedBERT-
      Large x2, PubMedBERT-Base x6.
    • system-1: DPRF applied to each system-0 submission.
    • system-2: Ensemble of 5 PubMedBERT-Base models.
    • system-3: Our 2023 submission excluding some models (16 total)
    • system-4: Ensemble of each submission file (the 4 files above)

A.4. Batch 4
    • system-0: Same as batch 3 system-0.
    • system-1: DPRF applied to each system-0 submission.
    • system-2: Ensemble of 11 pairwise models: BiolinkBERT-Large x3, BiolinkBERT-Base x5,
      PubMedBERT-Base x3.
    • system-3: Our top 8 models over validation: BiolinkBERT-Large x3, BiolinkBERT-Base x5, 7 of
      these models were trained in a pairwise manner.
    • system-4: Ensemble of all the runs, combined with 2023 runs ( 5 total)
B. Phase A+ runs
B.1. Batch 1
   • system-0: This run utilized an ensemble of the two models and two prompts from the first
     submission (system-0) of Phase A.
   • system-1: This run utilized an ensemble of the two models and two prompts from the second
     submission (system-1) of Phase A.
   • system-2: This run utilized an ensemble of the two models and two prompts from the fifth
     submission (system-4) of Phase A.
   • system-3: This run utilized an ensemble of all Mixtral models and two prompts.
   • system-4: This run utilized an ensemble of all runs (18 total).

B.2. Batch 2
   • system-0: This run utilized an ensemble of the two models and two prompts from the fifth
     submission (system-4) of Phase A.
   • system-1: This run utilized an ensemble of all Mixtral models and two prompts (8 total).
   • system-2: This run used an ensemble of all runs from either Mixtral or Llama models (20 total).
   • system-3: This run was an ensemble of Gemma with the top 5 sources per question from each
     source from Phase A.
   • system-4: This run was an ensemble of Gemma with the best sources per question from each
     source from Phase A.

B.3. Batch 3+4
   • system-0: This run utilized an ensemble of the two models and two prompts from the second
     submission (system-1) of Phase A.
   • system-1: This run utilized an ensemble of the two models and two prompts from the third
     submission (system-2) of Phase A.
   • system-2: This run utilized an ensemble of the two models and two prompts from the fifth
     submission (system-4) of Phase A.
   • system-3 and system-4: These runs utilized an ensemble of the two prompts from the fifth
     submission (system-4) of Phase A, using only Llama, with Mixtral summarization.


C. Phase B runs
C.1. Batch 1
   • system-0: This run utilized an ensemble of 2 prompts using the llama2 model, with abstracts as
     sources.
   • system-1: This run utilized an ensemble of 2 prompts using both the llama2 + mixtral model
     model, with abstracts as sources.
   • system-2: This run utilized an ensemble of 2 prompts using both the llama2 + mixtral model
     model, with snippets as sources.
   • system-3: Using exact answers to generate an answer.
   • system-4: Ensemble of all submissions (16 total)

C.2. Batch 2
   • system-0: This run utilized an ensemble of 2 prompts using both the llama2 + mixtral model
     model, with abstracts as sources.
   • system-1: This run utilized an ensemble of 2 prompts using both the llama2 + mixtral model
     model, with snippets as sources.
   • system-2: Utilised the fine-tuned gemma model, with a single abstract.
   • system-3: This run utilized an ensemble of two gemma submissions, one with a single abstract
     and one with multiple abstracts.
   • system-4: Ensemble of all submissions (10 total)

C.3. Batch 3 + 4
   • system-0: This run utilized an ensemble of 2 prompts using both the llama3 + mixtral model
     model, with snippets as sources.
   • system-1: This run utilized an ensemble of 2 prompts using both the llama3 + mixtral model
     model, with snippets as sources, with a mixtral summary applied.
   • system-2: This run utilized an ensemble of 2 prompts using both the llama3 + mixtral model
     model, with abstracts as sources.
   • system-3: This run utilized an ensemble of 2 prompts using both the llama3 + mixtral model
     model, with abstracts as sources, with a mixtral summary applied.
   • system-4: This run utilized an ensemble of 2 prompts using only the llama3 model, with abstracts
     and snippets as sources,