<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Harnessing Collective Intelligence of LLMs for Robust Biomedical QA: A Multi-Model Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dimitra Panou</string-name>
          <email>panou@fleming.gr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandros C. Dimopoulos</string-name>
          <email>dimopoulos@fleming.gr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manolis Koubarakis</string-name>
          <email>koubarak@di.uoa.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Reczko</string-name>
          <email>reczko@fleming.gr</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Archimedes, Athena Research Center</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Informatics &amp; Telematics, School of Digital Technology, Harokopio University</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Informatics and Telecommunications, National and Kapodistrian University of Athens</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Institute for Fundamental Biomedical Science, Biomedical Sciences Research Center "Alexander Fleming"</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Biomedical text mining and question-answering are essential yet highly demanding tasks, particularly in the face of the exponential growth of biomedical literature. In this work, we present our participation in the 13th edition of the BioAsq challenge, which involves biomedical semantic question-answering for Task 13b and biomedical question-answering for developing topics for the Synergy task. We deploy a selection of open-source large language models (LLMs) as retrieval-augmented generators to answer biomedical questions. Various models are used to process the questions. A majority voting system combines their output to determine the final answer for Yes/No questions, while for list and factoid type questions, the union of their answers in used. We evaluated 13 state-of-the-art open source LLMs, exploring all possible model combinations to contribute to the final answer, resulting in tailored LLM pipelines for each question type. Our findings provide valuable insight into which combinations of LLMs consistently produce superior results for specific question types. In the four rounds of the 2025 BioAsq challenge, our system achieved notable results: in the Synergy task, we secured 1st place for ideal answers and 2nd place for exact answers in round 2, as well as two shared 1st places for exact answers in rounds 3 and 4.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Biomedical Question Answering</kwd>
        <kwd>BioAsq</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Retrieval-augmented generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1.1. The BioAsq Challenge</title>
        <p>The BioAsq challenge has played a central role in advancing biomedical question answering (QA),
particularly through its tasks. Task B requires systems to retrieve relevant documents and snippets and
then generate precise answers to biomedical questions, while the Synergy track adds further complexity
by introducing an interactive, feedback-based QA setting, simulating real-world clinical scenarios.
These tasks push the limits of current Retrieval-Augmented Generation (RAG) systems, demanding
high precision in both information retrieval and generation.</p>
        <p>
          Although large language models (LLMs) are becoming increasingly eficient today, the BioAsq
challenge demonstrates that achieving accurate results relies on well-structured and carefully designed
QA pipelines. Efective systems use hybrid retrievers [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], domain-specific encoders [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ], and fine-tuned
generators tailored for biomedical text. Pipelines often include re-ranking steps, prompt tuning [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ],
and targeted post-processing to handle subtasks like yes/no classification or list generation. These
components are critical to ensure relevance, factuality and clarity, something end-to-end LLMs still
struggle with in complex domains.
        </p>
        <p>
          Our lab has participated in BioAsq Challenge for three consecutive years. During this period, we
experimented with various methodologies to enhance the document selection task. We began by
developing our own model, ELECTROLBERT [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], and later fine-tuned a GAN [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] combined with sparse
BM25 for document ranking. In our most recent iteration, we transitioned to leveraging existing models
for document retrieval, systematically exploring and comparing sparse, dense, and hybrid approaches
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>Despite improvements in our pipelines, we observed that the Mean Average Precision (MAP) in Phase
A remains relatively low for all participants. This is mainly because selecting the right documents has
become more dificult as the document collection keeps growing. Matching the retrieved documents
with the small set chosen by experts remains a challenge. On the other hand, there is still room to
improve how answers are generated from the retrieved documents. That’s why in this work, we focus
on improving the generation of ‘ideal’ and ‘exact’ answers in Phase B.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. Synergy</title>
        <p>
          For the Synergy challenge [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], we used the same methods as for the final submissions of the BioAsq12
competition, with the notable addition of a DeepSeek-R1 model variant for the generation of exact and
ideal answers [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The improved language generation skills of this model led to a notable improvement
in the free text required for the ideal answers.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Task 13b, phase A: Document Retrieval &amp; Snippet Identification</title>
        <p>
          In Phase A and Aplus of the BioAsq challenge, the organizers release biomedical questions curated by
experts [
          <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
          ] that have to be processed within a strict 24-hour interval. For Phase A participants
have to retrieve and submit up to 10 relevant documents per question, utilizing abstracts sourced from
the PubMed1 database. Based on the retrieved documents, participants must then identify and extract
the most relevant snippets.
        </p>
        <p>
          For document retrieval in Phase A, we adopted a standard approach shown to deliver strong
performance in previous work [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], specifically using the BM25 [
          <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
          ] ranking algorithm enhanced with
pseudo-relevance feedback from RM3 [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. From this setup, we initially retrieved the top 50 candidate
documents and subsequently re-ranked them based on the relevance of their associated snippets. Snippet
prediction, which extracts the most semantically relevant snippet from each of the top 10 retrieved
documents, is performed as described in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Task13b, phase A+ / phase B: exact answer generation</title>
        <p>In Phase A+ participants will submit exact and/or ideal answers before the expert selected (gold)
documents and snippets (released in Phase B) are known. It serves as a baseline to compare with Phase
B, where feedback is provided to guide system improvement. Each participant must rely on their own
predicted documents and snippets for subsequent processing. They have 24 hours to submit their
results, which include documents, snippets, exact and ’ideal’ answers, based on the provided test set.
For document selection, we followed the same procedure as in Phase A. To generate the exact and ’ideal’
answers, we used both the predicted snippets and the full abstracts as input.</p>
        <p>In Phase B, participants are required to submit exact answers for Yes/No, List, and Factoid questions,
as well as ideal answers for summary-type questions. This phase uses gold-standard documents and
snippets. In Phase B, we explored three distinct approaches for generating exact answers, as illustrated
in Figure1.</p>
        <p>The first approach (Figure 1, method a.) utilizes the extracted snippets from the given (golden)
documents, incorporating them directly into the prompt to generate answers for each question. This
method has been used in our previous submissions and is commonly adopted by participants in the
BioAsq challenge. It is computationally eficient, as the snippets are typically short, ranging from a few
words to two sentences. The second approach (Figure 1, method b.) uses the full abstracts of the top 10
most relevant documents. The prompt is constructed by combining the question with these abstracts
in the following format: text = &lt;Abstract 1&gt;, &lt;Abstract 2&gt;, ..., &lt;Abstract 10&gt;, as
shown in Appendix A. This method provides broader contextual information and outperformed the
ifrst approach in our evaluations.</p>
        <p>The third approach (Figure 1, method c.) builds upon the second by additionally incorporating any
relevant documents identified during the extended document retrieval process in Phase A+. These
supplementary documents are appended to the original list, further enriching the input context provided
to the model and potentially including documents not selected by the experts who created the gold
answers.</p>
        <p>To improve the answer performance measures, we employed an LLM ’farming’ strategy, which
we initially implemented last year for Yes/No questions. This strategy utilizes a diverse ensemble of
complementary open-source large language models. In the present study, we extend this strategy to all
exact answer types, aggregating the union of answers from multiple LLMs for factoid and list questions.</p>
        <p>
          Using the BioAsq11 and BioAsq12 training set, we evaluated 13 state-of-the-art LLMs using Ollama
[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and LM Studio [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], systematically analyzing their individual performance, as well as all possible
combinations of models for each type of question. This experimentation allowed us to construct an
optimal ‘farm’ of models for each category of question. Due to the long run-time of these optimizations,
our submissions in the competition did not represent the finally best performing system. For all types
of questions, the optimization revealed novel combinations of models with higher performance than
any single LLM.
        </p>
        <sec id="sec-2-3-1">
          <title>2.3.1. Optimal factoid question answering subsets</title>
          <p>
            For the 13 LLMs listed in table 1, there are |{,  ∈ [
            <xref ref-type="bibr" rid="ref1">1, 213</xref>
            ]}| = 8191 diferent subsets. The 13 LLMs
predict sets of factoids for all questions of BioAsq11 and BioAsq12 separately. All predictions are
compared to the golden answers of BioAsq11 and BioAsq12, respectively. The factoid sets for each LLM
in  are combined to form a union for each question. Since the factoids should be ordered by relevance
and only the top 5 most relevant should be returned, the combination of the factoids considers the
relevance scores that are also returned by each LLM. The performance of these sets is evaluated with the
usual Mean Reciprocal Rank ( ) measure. The   values, averaged over four rounds, for each
 are shown in the scatterplot 3 that visualizes the performances in BioAsq11 and BioAsq12. The color
of each dot indicates the size of the LLM set. Single LLMs are shown in red, the largest sets containing 6
LLMs are shown in blue. It should be noted that in all cases the sets with more than 6 LLMs had the same
performance as a ’kernel’ set of 6 LLMs and are not contained in the plot. As all high-performing sets
1https://huggingface.co/mattshumer/ref_70_e3
2https://ollama.com/library/llama3.1:70b
3https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct
4https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
5https://huggingface.co/Qwen/Qwen3-14B
6https://huggingface.co/Qwen/Qwen3-30B-A3B
7https://huggingface.co/Qwen/Qwen3-32B
8https://huggingface.co/01-ai/Yi-34B
9https://huggingface.co/senseable/Smaug-72B-v0.1-gguf/blob/main/Smaug-72B-v0.1-q4_k_m.gguf
10https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF
11https://ollama.com/library/phi3:medium
12https://ollama.com/library/phi4
13hhttps://ollama.com/library/aya:35b
occur in blue tones, it can be clearly seen that all red dots for single LLMs had worse performances than
any union of at least 4 LLMs (see also figure 2). Consistently, the highest performances are obtained with
unions of 6 LLMs (DS-R1d-70B, llama3.3, qwen3-14b, qwen3-32b, reflection, smaug). These observations
can be made for both BioAsq11 and BioAsq12 independently, indicating no training set specificities. The
ifnding that larger unions give better results is very likely due to the complementarity of the answers
of the diferent LLMs. There can be many cases where one method finds a highly relevant factoid that
another method does not identify at all or as a close miss. The merging strategy that uses the confidence
scores supports these situations.
Factoid deduplication As forming the union might introduce multiple occurrences of exactly the
same factoid phrase or semantically similar phrases, we investigated a simple deduplication procedure.
Each factoid phrase is embedded with a standard transformer (all-MiniLM-L6-v2) and the cosine
similarity of the embedding between all factoids is measured. With diferent thresholds for the cosine
similarity, semantically similar phrases can be removed from the set. The   performance with
diferent thresholds for the LLM subset with the best performance on BioAsq12 is shown in figure
4. It can be observed that deduplication does not improve   performance, consistent with our
observation that larger subsets in general have higher   performances.
          </p>
        </sec>
        <sec id="sec-2-3-2">
          <title>2.3.2. Optimal list question answering subsets</title>
          <p>A procedure similar to the processing of the factoid questions is used for list type questions. As no
relevance order and no limit is required for the list items in the answer, the set of list items is the simple
union of the list items predicted by each LLM in the subset . The usual performance measure for list
type questions is the F-Measure, which is the harmonic mean of precision and recall. With a growing
size of the list items in the union for each additional LLM in the subset , the chance of false positive
items increases and precision decreases. In the scatterplot showing the F-Measure scores (averaged
over four rounds) for BioAsq11 and BioAsq12 in figure 6, it can be clearly seen that the large subsets
with more than 7 LLMs have significantly lower performance than the smaller subsets, independently
of the BioAsq dataset used. However, as detailed in figure 5, several specific combinations, such as the
set [DS-R1d-70B, L3.3, Qwen14] have better performance than any of the single LLMs.
List deduplication The same deduplication procedure used for factoids is also evaluated for the union
of list items. For the subset with the best performance on the BioAsq12 set, the F-Measure performance
for diferent thresholds for the cosine similarity is shown in figure 7. It can be observed that two levels
of deduplication achieve higher performance than without deduplication, with an optimal F-Measure
values at a threshold of 0.76. A threshold of 0.7 was used for the list type submission in the BioAsq13
competition. Comparing to the deduplication results for factoid questions, we confirm the general efect
that deduplication improves the F-measure (used to evaluate list questions) by increasing precision
without harming recall, but it does not change the Mean Reciprocal Rank (MRR, used to evaluate factoid
questions) because the position of the first correct result usually remains unchanged.</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>2.3.3. Optimal jury sets answering Yes/No questions</title>
          <p>
            The concept of using a jury (or ‘farm’) of LLMs was introduced by our lab for BioAsq12 [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ]. Here
we further optimize this by evaluating all possible combinations of LLMs and adding more recent
LLMs. A subset  of LLMs generates an answer by counting the number of ’Yes’ and ’No’ outcomes
for each participating LLM. The final answer will be ’Yes’ if there are a higher or equal number of
’Yes’ outcomes than ’No’ outcomes. The performances of the diferent subsets with the usual macroF1
measure (averaged over four rounds) is shown for BioAsq11 and BioAsq12 in figure 9. The discrete
nature of this question type leads to more discrete performance levels that are visualized in the plot by
applying a small jitter. As in the case of the list type questions, it can be seen that there are several
combinations of a few LLMS like [Aya, Qwen32, Smaug] that outperform any of the individual LLM
alone.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>
        Here, we present the results across all our participations (Synergy and Phase B) in the BioAsq competition.
All systems submitted throughout the diferent phases are listed under the name Fleming-X in the
results. The evaluation of systems participating in the BioAsq competition Task B varies [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] based on
the question type.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Synergy Results</title>
        <p>In the four rounds of the Synergy 2025 BioAsq challenge, our system achieved notable results: first
place in round 2 for ’ideal answers’ and second place in rounds 3 and 4.</p>
        <p>As shown in Table 2, the evaluation for ’Exact’ answers includes multiple measures across three question
types: Accuracy and Macro F1 for Yes/No questions, Mean Reciprocal Rank (MRR) and Lenient Accuracy
for Factoid questions, and Mean Precision and F-measure for List questions. The overall position per
system in each batch is calculated based on a combination of Macro F1 (Yes/No), MRR (Factoid), and
F-measure (List). While the top-ranked systems achieve the highest combined scores, the Fleming
submissions stand out with stronger results in the List category and competitive performance in Factoid.
Table 3 presents the performance of systems on the ’Ideal answers’ task, as assessed through manual
evaluation. The scores reflect human judgments across four criteria: Readability, Recall, Precision, and
Repetition, with the final Mean Manual score representing their average. In Batch 2, the Fleming system
achieved the highest overall score, ranking 1st, while in Batches 3 and 4, it remained competitive with
particularly strong Recall and Repetition scores, securing 2nd and 4th place respectively.
3.2. Task 13b: Phase A</p>
        <sec id="sec-3-1-1">
          <title>3.2.1. Document retrieval</title>
          <p>In table 4 the preliminary performances of our document retrieval submissions for the BioAsq13
competition are listed. The final and oficial results, will be available shortly before the BioAsq13
workshop, after the manual assessment of all system responses by the BioAsq experts and the enrichment
of the respective ground truth with potential additional correct elements.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Task 13b: Phase A+ and Phase B</title>
        <sec id="sec-3-2-1">
          <title>3.3.1. Exact answer prediction</title>
          <p>The tables reporting the Phase A+ (Table 5) and Phase B (Table 6) results of BioAsq13, for exact answers
provide a comparative view of our submitted systems. In each batch, the first row corresponds to the
top-ranked competitor. For each question type, we report a corresponding evaluation metric: Macro F1
for Yes/No, MRR for Factoid, and F-Measure for List. The systems are ranked per metric and the total
rank is computed as the sum of these individual ranks, providing an overall measure of performance
across all type of questions. The final position according to the total rank and the total number of
submissions is indicated in the column ’Position’. Our systems demonstrated competitive performance,
particularly in the Yes/No and Factoid categories.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.3.2. Ideal answer prediction</title>
          <p>Regarding the evaluation of the ideal answer for both Phase A+ and Phase B of Task 13b, we are currently
waiting for the release of the scores manually assigned by the BioAsq experts, which are expected to be
published shortly before the CLEF workshop in September. We note that all results for Task 13b remain
provisional, as small corrections may still be applied by question curators prior to the workshop.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and Future Work</title>
      <p>
        In this work, we presented a robust and extensible methodology for biomedical question answering
within the BioAsq challenge framework. A key innovation in our methodology is the application and
generalization of an LLM ‘farming’ strategy, initially developed for Yes/No questions [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], to all exact
question types. By systematically evaluating 13 state-of-the-art open LLMs and exhaustively testing
all possible model combinations, we created optimized model farms for Yes/No, factoid, and list type
questions. Our results show that combining multiple models improves performance in each case. For
Factoid questions, the best results came from combining six diferent LLMs. This pattern was consistent
across both the BioAsq11 and BioAsq12 datasets, suggesting that the improvement wasn’t specific
to the training data but rather due to the diferent strengths of each model working together. The
top-performing combinations are shown in Table 7. For List questions, using too many models actually
reduced performance. The best results came from small groups of about three models, for example,
[DS-R1d-70B, L3.3, Qwen14], which outperformed all single models. For Yes/No questions, smaller
combinations also worked best. Groups of three to four models, like the jury of Aya, Qwen32, Smaug,
outperformed individual models. In summary, combining LLMs can improve performance, but the
optimal number of models depends on the question type.
      </p>
      <p>Moving forward, we plan to expand our evaluation to include more state of the art open-source LLMs,
incorporate more confidence scoring mechanisms across model outputs to better weigh and reconcile
conflicting answers and release our question answering system to support reproducibility and foster
collaboration in the open LLM community.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>We thank the anonymous reviewers for their valuable questions and comments. We express our gratitude
to the BioAsq challenge organizers for organizing the event and ofering continuous support. The GPU
computations were executed on two servers acquired as part of project ID 16624, titled "Creation
Expansion - Upgrading of the Infrastructures of research centers supervised by the General Secretariat
for Research and Innovation (GSRI)" with the code MIS 5161770. This project received funding under
the "National Recovery and Resilience Plan Greece 2.0".</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, Gemini, Copilot in order to: Grammar
and spelling check, paraphrase and reword. After using these services, the authors reviewed and edited
the content as needed and take full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Appendix</title>
      <p>In all these prompts, the %s after QUESTION is replaced by the actual question, and the %s after
INFORMATION, TEXT or ABSTRACT is replaced with the collection of the related snippets or abstracts,
concatenated and separated by a single blank.</p>
      <p>Yes/No Prompt
Given only the following INFORMATION and QUESTION, answer the QUESTION only with
’Yes’ or ’No’. Think carefully. INFORMATION: %s QUESTION: %s
Factoid Prompt
Answer the QUESTION using only the TEXT by only returning a list of entity names, numbers,
or similar short expressions that are an answer to the question and are separated by commas.
Only the list should be returned. If you do not know any answer return the word EMPTY. TEXT:
%s QUESTION: %s
Answer the QUESTION using only the TEXT by only returning a list of entity names,
numbers, or similar short expressions that are an answer to the question and are separated by
commas,ordered by decreasing confidence. Only the list should be returned. If you do not know
any answer return the word EMPTY. TEXT: %s QUESTION: %s
Summary Prompt
##ABSTRACT: %s ##QUESTION: %s ##TASK: Answer the QUESTION by returning a single
paragraph sized text ideally summarizing only the most relevant information in the ABSTRACT.
# of LLMs
6
3
3</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>OpenAI</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Achiam</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Adler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Akkaya</surname>
            ,
            <given-names>F. L.</given-names>
          </string-name>
          <string-name>
            <surname>Aleman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Almeida</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Altenschmidt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Altman</surname>
          </string-name>
          , et al. (
          <volume>281</volume>
          authors),
          <source>Gpt-4 technical report</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/ abs/2303.08774.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mandikal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mooney</surname>
          </string-name>
          ,
          <article-title>Sparse meets dense: A hybrid approach to enhance scientific document retrieval</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2401.04055.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Biobert: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <year>2019</year>
          )
          <fpage>1234</fpage>
          -
          <lpage>1240</lpage>
          . URL: https://doi.org/10.1093/bioinformatics/btz682.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          , H. Cheng, M. Lucas,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <article-title>Domain-specific language model pretraining for biomedical natural language processing</article-title>
          ,
          <year>2020</year>
          . arXiv:arXiv:
          <year>2007</year>
          .15779.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ateia</surname>
          </string-name>
          , U. Kruschwitz,
          <article-title>Is chatgpt a biomedical expert? - exploring the zero-shot performance of current gpt models in biomedical tasks</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2306.16108. arXiv:
          <volume>2306</volume>
          .
          <fpage>16108</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Reczko</surname>
          </string-name>
          , Electrolbert:
          <article-title>Combining replaced token detection and sentence order prediction</article-title>
          .,
          <source>in: CLEF (Working Notes)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>335</fpage>
          -
          <lpage>340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Panou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reczko</surname>
          </string-name>
          ,
          <article-title>Semi-supervised training for biomedical question answering</article-title>
          .,
          <source>in: CLEF (Working Notes)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>152</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Panou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reczko</surname>
          </string-name>
          ,
          <article-title>Farming open llms for biomedical question answering</article-title>
          ,
          <source>CLEF Working Notes</source>
          (
          <year>2024</year>
          ). URL: https://doi.org/10.5281/zenodo.13683433.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Paliouras, Overview of BioASQ Tasks 13b and Synergy13 in CLEF2025</article-title>
          , in: CLEF 2025 Working Notes,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>DeepSeek-R1-Distill-Llama-</surname>
          </string-name>
          70B
          <string-name>
            <surname>-GGUF</surname>
          </string-name>
          ,
          <year>Huggingface</year>
          ,
          <year>2025</year>
          . URL: https://huggingface.co/ second-state/DeepSeek-R1
          <string-name>
            <surname>-Distill-Llama-</surname>
          </string-name>
          70B
          <string-name>
            <surname>-GGUF</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. N. Maria</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Nunzio</surname>
          </string-name>
          , Giorgio,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Sixteenth International Conference of the CLEF Association</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          , E. Tutubalina, G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          , et al.,
          <source>BioASQ</source>
          at CLEF2025:
          <article-title>The Thirteenth Edition of the Large-Scale Biomedical Semantic Indexing and Question Answering Challenge</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2025</year>
          , pp.
          <fpage>407</fpage>
          -
          <lpage>415</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          ,
          <article-title>The probabilistic relevance framework: Bm25 and beyond</article-title>
          ,
          <source>Found. Trends Inf. Retr</source>
          .
          <volume>3</volume>
          (
          <year>2009</year>
          )
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          . URL: https://doi.org/10.1561/1500000019.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Beaulieu</surname>
          </string-name>
          ,
          <article-title>Experimentation as a way of life: Okapi at trec</article-title>
          ,
          <source>Information Processing Management</source>
          <volume>36</volume>
          (
          <year>2000</year>
          )
          <fpage>95</fpage>
          -
          <lpage>108</lpage>
          . URL: https://www.sciencedirect.com/science/article/ pii/S0306457399000461.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lavrenko</surname>
          </string-name>
          , W. B.
          <string-name>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Relevance based language models</article-title>
          ,
          <source>in: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '01,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2001</year>
          , p.
          <fpage>120</fpage>
          -
          <lpage>127</lpage>
          . URL: https://doi.org/10.1145/383952.383972. doi:
          <volume>10</volume>
          .1145/383952.383972.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Ollama</surname>
          </string-name>
          , github,
          <year>2024</year>
          . URL: https://github.com/ollama/ollama.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>L. AI</surname>
          </string-name>
          ,
          <article-title>Lightning studio (lm studio)</article-title>
          , https://lightning.ai/docs/studio/,
          <year>2023</year>
          . Software.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <article-title>Evaluation measures for task b</article-title>
          ,
          <string-name>
            <surname>BioASQ-EvalMeasures-taskB</surname>
          </string-name>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <article-title>Yes/No Best Model(s) DeepSeek-R1-</article-title>
          <string-name>
            <surname>Distill-Llama-</surname>
          </string-name>
          70B
          <source>, LLaMA 3</source>
          .3,
          <fpage>Qwen3</fpage>
          -14B,
          <fpage>Qwen3</fpage>
          - 32B, Reflection, Smaug
          <string-name>
            <surname>DeepSeek-R1-Distill-Llama-</surname>
          </string-name>
          70B
          <source>, LLaMA 3</source>
          .3,
          <fpage>Qwen3</fpage>
          -14B
          <string-name>
            <surname>Aya</surname>
          </string-name>
          ,
          <fpage>Qwen3</fpage>
          -32B, Smaug
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>