<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Are Smaller Open-Weight LLMs Closing the Gap to Proprietary Models for Biomedical Question Answering?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Damian Stachura</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joanna Konieczna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Artur Nowak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Evidence Prime</institution>
          ,
          <addr-line>Krakow</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Open-weight versions of large language models (LLMs) are rapidly advancing, with state-of-the-art models like DeepSeek-V3 now performing comparably to proprietary LLMs. This progression raises the question of whether small open-weight LLMs are capable of efectively replacing larger closed-source models. We are particularly interested in the context of biomedical question-answering, a domain we explored by participating in Task 13B Phase B of the BioASQ challenge. In this work, we compare several open-weight models against top-performing systems such as GPT-4o, GPT-4.1, Claude 3.5 Sonnet, and Claude 3.7 Sonnet. To enhance question answering capabilities, we use various techniques including retrieving the most relevant snippets based on embedding distance, in-context learning, and structured outputs. For certain submissions, we utilize ensemble approaches to leverage the diverse outputs generated by diferent models for exact-answer questions. Our results demonstrate that open-weight LLMs are comparable to proprietary ones. In some instances, open-weight LLMs even surpassed their closed counterparts, particularly when ensembling strategies were applied. All code is publicly available at https://github.com/evidenceprime/BioASQ-13b.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Biomedical Question Answering</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Zero-Shot Prompting</kwd>
        <kwd>Few-Shot Prompting</kwd>
        <kwd>In-Context Learning</kwd>
        <kwd>GPT-4</kwd>
        <kwd>Claude</kwd>
        <kwd>Open-Weight LLM</kwd>
        <kwd>Ensembling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In question answering tasks, access to domain-specific knowledge can significantly enhance response
quality, particularly when answers need to be grounded in provided supplementary materials.</p>
      <p>
        The BioASQ Challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] exemplifies such a task in biomedical domain. In this challenge,
participating systems are provided with relevant biomedical papers from the PubMed database. These
materials can then be leveraged to generate high-quality responses to the posed questions. The questions
themselves span four distinct types: yes/no, factoid, list, and summary questions.
      </p>
      <p>
        Over the years, various approaches have been employed for question answering tasks in the BioASQ
Challenge. In its earliest editions, classic methods were applied, such as BM25, which ranked retrieved
documents based on their relevance to a question. Additionally, researchers applied similarity algorithms
using vector embeddings, algorithms based on linguistic annotations of texts, and early deep learning
models. For an extended period, BERT-based solutions, notably BioBERT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and PubMedBERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
dominated these question answering tasks. Sequence-to-sequence models like T5 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] also proved to be
useful.
      </p>
      <p>
        However, since the global emergence of ChatGPT [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] in 2022, large language models have
significantly reshaped the competitive landscape of question answering. Proprietary models, including
OpenAI’s oferings, Gemini [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and Claude [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], initially dominated this field, fostering the belief that
creating robust LLMs requires a massive financial investment.
      </p>
      <p>
        A significant shift began in 2024 and 2025, with a growing number of organizations publicly releasing
models featuring open weights and permissive licenses. Today, large open-weight models like Llama
3-405B [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], DeepSeek-V3 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and Qwen3-235B-A22B [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] are proving capable of challenging even the
best proprietary models, as evidenced by platforms like the LM Arena [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Perhaps even more impactful
are the smaller open-weight models, which can run on consumer-grade machines and demonstrate
impressive competitiveness in tasks requiring access to domain-specific knowledge, especially within a
retrieval-augmented generation (RAG) setup.
      </p>
      <p>
        This year marks the 13th edition of the BioASQ Challenge, and we participated in Task 13B, Phase B
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Our primary objective was to investigate whether relatively small LLMs, primarily those up to 14
billion parameters, could efectively compete with more powerful proprietary models in this biomedical
question-answering context. To achieve this, we explored multiple strategies for enhancing our results.
We utilized in-context learning by leveraging the provided database of questions from previous BioASQ
challenge editions. Additionally, we used similarity algorithms using vector embeddings to select a
pertinent subset of snippets from the provided PubMed articles for each question. This paper details
the results from four question batches of Task 13B Phase B and discusses our conclusions.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>We experimented with numerous techniques to optimize performance in biomedical question answering.
The successful approaches implemented in our solutions are outlined below.</p>
      <sec id="sec-2-1">
        <title>2.1. Best Snippets Selection</title>
        <p>
          For our submissions, we selected the 10 best-matching snippets from the provided PubMedQA articles.
Our team experimented with varying snippet counts, validating them against datasets from previous
BioASQ challenge versions. This process led us to select 10 snippets as the optimal number, as it
consistently produced the most robust results. We utilized the sentence-transformers library [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
with the nomic-embed-text-v1 [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] model. Our approach involved computing embeddings for all
snippets and the question. Subsequently, we calculated the cosine similarity between each (snippet,
question) pair to identify options with the highest similarity. Finally, these selected snippets were
provided to the model, ordered from most to least similar.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. In-Context Learning</title>
        <p>
          Research has demonstrated that in-context learning enhances the performance of language models in
diverse applications [
          <xref ref-type="bibr" rid="ref16 ref5">5, 16</xref>
          ]. We investigated how diferent models performed with in-context learning,
sourcing examples from previous BioASQ challenge editions. We used Qdrant [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], a vector database,
into which we inserted computed embeddings for all previous questions combined with their 10 best
snippets. Subsequently, for each new question, we queried the database for the most similar elements,
following the approach presented in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Experimentally, we determined that 3 examples were optimal
for factoid and list questions, while a zero-shot approach was used for yes/no and summary type
questions.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Prompts</title>
        <p>
          For all question types, we utilized hand-crafted prompts. As noted previously, a zero-shot prompting
approach was employed for yes/no and summary questions, given empirical observations that few-shot
prompting detrimentally afected performance for these specific categories. Conversely, for factoid
and list questions, few-shot prompting demonstrated clear benefits. Accordingly, we implemented
a 3-shot prompting strategy for these questions, based on insights gained through comprehensive
experimentation. The system prompts are detailed in Table 1, and the actual prompts guiding the
models to generate answers for all question types are presented in Table 2. We also briefly experimented
with DSPy [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] for the automated generation of prompts based on predefined input and output schemas
in batch 2. However, responses achieved using DSPy were slightly worse than those achieved with our
        </p>
        <p>You are an expert in the medical texts summarization.</p>
        <p>Answer the given question with a single paragraph
text and your answer should be based on the provided
context snippets. You should generate your response
in at most 2-3 sentences (30-50 words).</p>
        <p>Given only the following SNIPPETS and QUESTION, answer
the QUESTION only with ’Yes’ or ’No’.</p>
        <p>Extract key biomedical entities **strictly using the provided
SNIPPETS** to answer the QUESTION. List **1 to 5** of the
most relevant entities, ranked by confidence. **Never exceed 5
entities.** If more exist, return only the top 5. Prefer concise
entities and **remove redundant or longer variants** of the
same term. If no relevant entities exist, return ‘None.‘.</p>
        <p>Extract key biomedical entities **strictly using the provided
SNIPPETS** to answer the QUESTION. List **1 to 5** of the
most relevant entities. Prefer concise entities and **remove
redundant or longer variants** of the same term. If no relevant
entities exist, return ‘None‘.</p>
        <p>Answer the QUESTION by returning a single paragraph sized
text (use max 50 words) ideally summarizing only the most
relevant information in the SNIPPETS.
hand-crafted prompts. In the future, we plan to investigate whether automatic prompt optimization can
help by creating model-specific prompts.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Structured Outputs</title>
        <p>
          We opted to use structured outputs to facilitate the extraction of LLM results in a predefined format.
We defined a JSON schema for response formatting, and subsequently followed a context-free grammar
(CFG) approach for it. We used CFG implementations introduced by model providers or accessed via
external libraries like Outlines [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] to guide the token sampling process. This methodology ensures
that the generated tokens adhere strictly to the schema, eliminating the need for complex regex-based
extraction from the model response.
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Models</title>
        <p>
          In our study, we utilized various LLMs, drawing from both open-weight and closed options. Our primary
focus was on relatively smaller open-weight models, specifically those with up to 14 billion parameters,
such as Phi-4 [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], Gemma-3-12B [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], Qwen2.5 14B [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], and Meditron Phi-4 14B [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. For a third
batch of experiments, we expanded our testing to include quantized versions of Gemma3-27B [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] and
Mistral3-24B [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. Although we briefly attempted to use HuatuoGPT-o1 [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], our limited exploration
of reasoning models meant we did not achieve strong results with it. We also incorporated several of
the newest closed-source models, including recent generations of GPT (GPT-4o [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], GPT-4.1 [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]) and
Claude (Claude Sonnet 3.5 [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], Claude Sonnet 3.7 [30]).
2.5.1. Quantized Models
For batches 1-3, our experiments used open-weight models quantized to 4-bit. However, in the final
batch, we proposed solutions based on the full, unquantized versions of these models. Interestingly, in
both setups, the open-weight models proved to be competitive with the closed alternatives.
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>2.6. Ensembling Method</title>
        <p>Ensembling methods are widely recognized as beneficial when combining responses from multiple
weaker models to achieve a single, stronger result. Various techniques exist for this purpose, including
majority voting, confidence scoring, and aggregation.</p>
        <p>For yes/no questions, we applied a straightforward majority voting approach, where the final answer
was determined by the option chosen by the most models. For factoid and list questions, we developed
a more sophisticated aggregation method. This involved collecting responses from all models and
calculating the frequency of each distinct response. The most frequent response was then selected,
provided its number of appearances exceeded a predefined threshold. The process is visualized on
Figure 1. For factoid questions, we limited the output list to a maximum of five best responses, as
specified in the rules for this question type.</p>
        <p>For later batches, we incorporated diferent classes of LLMs into our ensembling strategy. This
approach leveraged the observed benefits of integrating the diverse characteristics of various LLM
families such as Phi, Qwen, Mistral, and Gemma. As emphasized by Jiang et al. [31], diferent LLMs,
trained on varied data and architectures, inherently exhibit unique strengths that can be synergistic in
an ensemble.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>
        We participated solely in Task 13B, Phase B, of the BioASQ challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ][
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. This task encompasses
four types of questions: yes/no, factoid, and list questions, which are evaluated by matching exact
answers provided by the challenge organizers. In addition, there are summary questions that require an
ideal and free-form summary as a response, scored through automatic metrics and manual reviews. Our
approach involved testing multiple techniques and models across published question batches, leading
to distinct strategies for each.
      </p>
      <sec id="sec-3-1">
        <title>3.1. System Definitions</title>
        <p>For each submission, the rules detailed in 2 were applied. The models used to generate responses in
each submission, grouped by system name from 1 to 5, are presented below. The results for these
systems in all batches are summarized in Table 3, Table 4, Table 5, and Table 6. A detailed specification
of each submitted system follows:</p>
        <sec id="sec-3-1-1">
          <title>Batch 1:</title>
          <p>• EP-1: Phi-4
• EP-2: HuatuoGPT-o1 8B
• EP-3: Qwen2.5-14B
• EP-4: GPT-4o
• EP-5: Claude 3.5 Sonnet</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Batch 2:</title>
          <p>• EP-1: Ensemble - Gemma-3-12B, Qwen2.5-14B, Phi-4, GPT-4o, Claude 3.5 Sonnet
• EP-2: Claude 3.5 Sonnet
• EP-3: Phi-4
• EP-4: Phi-4 + DSPy prompt (only for factoid questions, without prompt optimization)
• EP-5: Qwen2.5-14B + DSPy prompt (only for factoid questions, without prompt optimization)</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Batch 3:</title>
          <p>Batch 4:
• EP-1: Ensemble - Mistral-Small-3.1-24B, Gemma-3-12b, Gemma-3-27b, Qwen2.5-14B, Phi-4
• EP-2: Ensemble - GPT-4o, GPT-4.1, Claude 3.5 Sonnet
• EP-3: GPT-4.1
• EP-4: Phi-4
• EP-1: Ensemble - GPT-4.1, GPT-4o, Claude 3.5 Sonnet, Claude 3.7 Sonnet
• EP-2: Ensemble - Gemma-3-12B, Qwen2.5-14B, Meditron3-Phi4-14B, Phi-4
• EP-3: Ensemble - Qwen2.5-14B, Meditron3-Phi4-14B, Phi-4, GPT-4.1, GPT-4o, Claude 3.5 Sonnet,</p>
          <p>Claude 3.7 Sonnet
• EP-4: Ensemble - Qwen2.5-14B, Phi-4, GPT-4.1, GPT-4o, Claude 3.7 Sonnet
• EP-5: GPT-4.1</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Exact Answers</title>
        <p>We used distinct answering strategies for each batch, as detailed in 3.1. For batch 1, our primary focus
was on assessing the performance of individual models within each system. In batches 2 and 3, we also
incorporated ensembling techniques, specifically involving combinations of open-weight and selected
closed models. It is important to note that for these batches, we only used quantized versions of open
models. Finally, for the last batch, we conducted a comparative analysis between full open-weight
models and proprietary models.</p>
        <p>The most advantageous approach for yes/no questions was dificult to determine. Proprietary models
achieved the best results in batches 1 and 4, while open-weight models dominated in the remaining. It
is shown in Table 3. The quality of the provided context appears to be a critical factor, with both model
types demonstrating suficient capability to extract key information pertinent to the question.</p>
        <p>List-based questions demonstrably pose a greater challenge for LLMs. Despite this, open-weight
models performed competitively to proprietary ones. Furthermore, we found that ensembling more
diverse models leads to improved scores.</p>
        <p>In more detail, ensembling a mixture of open and closed models proved to be beneficial for factoid
questions. In batch 2, single proprietary models were outperformed by such ensembling mixture.
This solution also represented the best approach in batch 4, surpassing individual closed models and
ensembles composed solely of open-weight or closed models. In batch 3, multiple solution types
exhibited competitive performance. Table 4 provides a summary of the results for factoid questions
across all batches.</p>
        <p>These insights are further corroborated by the findings from list questions, presented in Table 5.
Ensembling solely open-weight models or a mixture of both model types consistently yielded the best
approaches in batches 2, 3, and 4. For batch 1, the results between both model types were notably
similar. These observations strongly suggest that ensembles of open-weight models can address more
challenging tasks at a comparable, or even superior, level to closed models.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Ideal Answers</title>
        <p>
          For the summary questions, we directly generated each summary using a chosen LLM and the prompt
detailed in Table 2, without employing ensembling techniques. For systems involving multiple LLMs,
we generated candidate summaries from all participating models for each question. The best summary
was then selected using a cross-encoder reranking approach. This method involved calculating the
similarity score between each generated summary and its corresponding question. The summary with
the highest score was subsequently selected. For this purpose, we used the BiomedBERT model [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to
compute these similarity measures.
        </p>
        <p>The recall scores for ROUGE metrics [32] achieved by our method for batches 2 and 4 were comparable
to those of the top-performing solutions. However, F1 scores for ROUGE metrics in these batches were
significantly lower. The results for summary questions can be seen in Table 6.</p>
        <p>As shown in Table 6, Phi-4 exhibited the strongest performance among the evaluated LLMs, suggesting
that open-weight models can also be competitive for summary-based questions. However, a full analysis
of the responses to these questions requires manual scores that have not yet been published.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>Our primary goal was to evaluate the competitive performance of open-weight LLMs against
stateof-the-art proprietary LLMs for biomedical question answering. To this end, we rigorously tested
numerous configurations, including both smaller open-weight models and various closed models. Our
results consistently demonstrated the competitiveness of ensembles of open models for BioASQ 13B
Phase B questions.</p>
      <p>For yes/no questions, this thesis is supported by results from batches 2 and 3 (Table 3), where
openweight models outperformed closed-weight models. This trend also holds for list-based questions, with
batches 2 through 4 demonstrating strong performance by open models on both factoid and list-type
questions (Tables 4 and 5, respectively). For summary questions, the open-weight model Phi-4 exhibited
promising performance in terms of ROUGE metrics in Batches 1 through 3, as detailed in Table 6.</p>
      <p>This conclusion holds significant implications. The ability to use open-weight models negates the need
for proprietary solutions in every application. This is particularly relevant for applications involving
highly restricted data that require on-premise deployment, a common scenario with medical data. In
such contexts, smaller self-deployable models ofer a compelling and practical alternative.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research was co-funded by the European Union – European Regional Development Fund
(Programme: European Funds for a Modern Economy 2021-2027, grant no. FENG.01.01-IP.02-4479/23).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Gemini in order to: Grammar and spelling
check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and edited the
content as needed and take(s) full responsibility for the publication’s content.
[30] Anthropic, Claude 3.7 sonnet system card, 2025. URL: https://assets.anthropic.com/m/
785e231869ea8b3b/original/claude-3-7-sonnet-system-card.pdf.
[31] D. Jiang, X. Ren, B. Y. Lin, Llm-blender: Ensembling large language models with pairwise ranking
and generative fusion, 2023. URL: https://arxiv.org/abs/2306.02561. arXiv:2306.02561.
[32] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization
Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL:
https://aclanthology.org/W04-1013/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering</article-title>
          , in: J.
          <string-name>
            <surname>Carrillo-de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Biobert: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <year>2019</year>
          )
          <fpage>1234</fpage>
          -
          <lpage>1240</lpage>
          . URL: http://dx.doi.org/10.1093/bioinformatics/btz682. doi:
          <volume>10</volume>
          .1093/bioinformatics/btz682.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          , H. Cheng, M. Lucas,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <article-title>Domain-specific language model pretraining for biomedical natural language processing</article-title>
          ,
          <year>2020</year>
          . arXiv:arXiv:
          <year>2007</year>
          .15779.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/ abs/
          <year>1910</year>
          .10683. arXiv:
          <year>1910</year>
          .10683.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2005</year>
          .14165. arXiv:
          <year>2005</year>
          .14165.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] OpenAI, Chatgpt (nov
          <year>2022</year>
          version),
          <year>2022</year>
          . URL: https://chat.openai.com.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <article-title>Gemini: A family of highly capable multimodal models</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/ 2312.11805. arXiv:
          <volume>2312</volume>
          .
          <fpage>11805</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Anthropic</surname>
          </string-name>
          , Introducing claude,
          <year>2023</year>
          . URL: https://www.anthropic.com/index/introducing-claude, accessed:
          <fpage>2025</fpage>
          -05-28.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. . M.</given-names>
            <surname>Llama Team</surname>
          </string-name>
          ,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.21783. arXiv:
          <volume>2407</volume>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>DeepSeek-AI</surname>
          </string-name>
          ,
          <article-title>Deepseek-</article-title>
          v3
          <source>technical report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2412.19437. arXiv:
          <volume>2412</volume>
          .
          <fpage>19437</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <source>Qwen3 technical report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2505.09388. arXiv:
          <volume>2505</volume>
          .
          <fpage>09388</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>W.-L. Chiang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Sheng</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          <string-name>
            <surname>Angelopoulos</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>I. Stoica</given-names>
          </string-name>
          ,
          <article-title>Chatbot arena: An open platform for evaluating llms by human preference, 2024</article-title>
          . URL: https://arxiv.org/abs/2403.04132. arXiv:
          <volume>2403</volume>
          .
          <fpage>04132</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Paliouras, Overview of BioASQ Tasks 13b and Synergy13 in CLEF2025</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Nussbaum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. X.</given-names>
            <surname>Morris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Duderstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mulyar</surname>
          </string-name>
          ,
          <article-title>Nomic embed: Training a reproducible long context text embedder</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2402</volume>
          .
          <fpage>01613</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          , L. Zettlemoyer,
          <article-title>Rethinking the role of demonstrations: What makes in-context learning work?</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>11048</fpage>
          -
          <lpage>11064</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .emnlp-main.
          <volume>759</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          . emnlp-main.
          <volume>759</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Team</surname>
          </string-name>
          , Qdrant documentation,
          <year>2021</year>
          . URL: https://qdrant.tech/documentation/.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>O.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Herzig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Berant</surname>
          </string-name>
          ,
          <article-title>Learning to retrieve prompts for in-context learning</article-title>
          ,
          <source>ArXiv abs/2112</source>
          .08633 (
          <year>2021</year>
          ). URL: https://api.semanticscholar.org/CorpusID:245218561.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singhvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maheshwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Santhanam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vardhamanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Haq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , T. T. Joshi,
          <string-name>
            <given-names>H.</given-names>
            <surname>Moazam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Potts</surname>
          </string-name>
          , Dspy:
          <article-title>Compiling declarative language model calls into self-improving pipelines</article-title>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>B. T.</given-names>
            <surname>Willard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <article-title>Eficient guided generation for llms</article-title>
          ,
          <source>arXiv preprint arXiv:2307.09702</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aneja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Behl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gunasekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Hewett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Javaheripi</surname>
          </string-name>
          , P. Kaufmann,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C. T.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , E. Price, G. de Rosa,
          <string-name>
            <given-names>O.</given-names>
            <surname>Saarikivi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Y. Zhang, Phi-4
          <source>technical report</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2412.08905. arXiv:
          <volume>2412</volume>
          .
          <fpage>08905</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          , Gemma
          <volume>3</volume>
          (
          <year>2025</year>
          ). URL: https://goo.gle/Gemma3Report.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <source>Qwen2.5 technical report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2412.15115. arXiv:
          <volume>2412</volume>
          .
          <fpage>15115</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Team</surname>
          </string-name>
          , Model card:
          <source>Phi4 meditron-3[14b]</source>
          ,
          <year>2025</year>
          . URL: https://huggingface.co/OpenMeditron/ Meditron3-Phi4-14B.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <source>Mistral small 3.1</source>
          ,
          <year>2025</year>
          . URL: https://mistral.ai/news/mistral-small-3-1.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Huatuogpt-o1, towards medical complex reasoning with llms</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2412.18925. arXiv:
          <volume>2412</volume>
          .
          <fpage>18925</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Gpt-4o
          <source>system card</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2410.21276. arXiv:
          <volume>2410</volume>
          .
          <fpage>21276</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          ,
          <source>Introducing gpt-4</source>
          .1 in the api,
          <year>2025</year>
          . URL: https://openai.com/index/gpt-4-1/.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>[29] Anthropic, Claude 3.5 sonnet</source>
          ,
          <year>2024</year>
          . URL: https://www.anthropic.com/news/claude-3-5-sonnet.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>