<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applying DeepSeek to BioASQ Task 13B: Using Supervised Fine-Tuning and Few-Shot Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jie Tang</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hua Yang</string-name>
          <email>huayang@zut.edu.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kai Xiong</string-name>
          <email>xiongkai2024@163.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hanyang Li</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paulo Quaresma</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongbin Yu</string-name>
          <email>hongbinyu@zut.edu.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenbo Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mingzhou Song</string-name>
          <email>mingzhou@zut.edu.cn</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aquinas International Academy</institution>
          ,
          <addr-line>3200 E Guasti Rd, Suite 100, Ontario, 91761, CA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, VISTA Lab, Algoritmi Center, University of Évora</institution>
          ,
          <addr-line>7000-671, Évora</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Artificial Intelligence, Zhongyuan University of Technology</institution>
          ,
          <addr-line>Zhengzhou, 450007, Henan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>School of Computer Science, Zhongyuan University of Technology</institution>
          ,
          <addr-line>Zhengzhou, 450007, Henan</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The recent surge in popularity of DeepSeek has attracted significant attention, yet its practical performance in real-world applications remains largely unexplored. In this study, our team participated in BioASQ Task 13b, which focuses on biomedical information retrieval and question answering (QA). We evaluated the DeepSeek model using three diferent approaches: local deployment, API-based access, and supervised fine-tuning. Specifically, we investigated the model's performance in few-shot learning settings. Notably, in Phase A+, our system using the deepseek-r1:671b model combined with retrieval-augmented generation techniques ranked first among all 67 submitted runs on yes/no questions in Batch 4. In Phase B, systems using both the deepseek-r1:32b and deepseek-r1:671b models achieved top performance on yes/no questions in Batches 2 and 3. Additionally, the system using the deepseek-r1:32b model ranked first on list questions in Batch 1. Our results demonstrate the proposed method is efective in biomedical QA tasks and shows promising potential for future applications in the domain. The code is available at https://github.com/wuren519/bioasq-2025.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Model</kwd>
        <kwd>Few-Shot Learning</kwd>
        <kwd>Supervised Fine-tuning</kwd>
        <kwd>Biomedical Information Retrieval</kwd>
        <kwd>Biomedical Question Answering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Biomedical QA</title>
        <p>
          Biomedical QA systems aim to provide concise answers to specialized biomedical queries. The
BioASQ task includes expert-written English questions of four types: yes/no, factoid, list, and
summary[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In BioASQ Task 13b, systems retrieve relevant PubMed articles and snippets and produce
an “exact” answer and an “ideal” answer for each question. For yes/no questions, the exact answer is
“yes” or “no”; for factoid and list questions, it is the named entity or list of entities that answer the
question; for summary questions, no exact answer is given and only the ideal answer[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Importantly,
BioASQ provides all synonyms of the gold answers – e.g. a gene or disease may have multiple names –
so systems must handle these alternative forms. The ideal answers are paragraph-sized explanations,
intended to augment the exact answer. Therefore, the evaluation of the systems consists of two aspects:
ifrst, the assessment of exact answers, including accuracy or F1 score for yes/no questions, MRR for
factoid questions, and F-Measure for list questions; second, the evaluation of summary quality, which is
typically conducted through manual assessment or using ROUGE metrics.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. LLMs for biomedical/health QA</title>
        <p>The emergence of LLMs such as GPT, PaLM, and LLaMA has significantly advanced the field of
QA, including biomedical and health domains[8]. These models, trained on vast corpora of general and
domain-specific text, exhibit strong capabilities in understanding natural language queries, retrieving
relevant information, and generating fluent, contextually appropriate answers. In the biomedical
domain, this is particularly valuable due to the dense and specialized nature of medical texts[9].</p>
        <p>
          Recent studies have demonstrated the utility of LLMs for biomedical QA tasks[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Models such as
BioGPT[10] and PubMedGPT, pre-trained on biomedical literature, have shown improved performance
on tasks like document classification, named entity recognition, and QA. Other works have explored
prompt-based approaches, leveraging instruction-tuned LLMs to answer biomedical questions without
task-specific SFT[ 11]. Despite promising results, challenges remain, including handling domain-specific
terminology, ensuring factual correctness, and retrieving up-to-date evidence from sources like PubMed.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. LLMs Supervised Fine-tuning</title>
        <p>SFT has become a widely used strategy to adapt general-purpose LLMs to specific domains and
tasks[12]. By training on labeled examples, SFT enables LLMs to align better with target outputs, adhere
to domain-specific answer styles, and improve factuality. In biomedical QA, where precise terminology
and structured responses are often required, SFT is particularly valuable.</p>
        <p>Recent work has shown that models such as BioMedLM[13], BioGPT, and domain-adapted versions
of T5[14] or BERT[15] benefit significantly from SFT on biomedical corpora or QA datasets. These
models outperform zero-shot LLMs in tasks requiring structured output, such as factoid or list-type
answers in BioASQ.</p>
        <p>Given the promising results achieved by prior work using LLMs in biomedical QA tasks, and the
strong performance of the DeepSeek models across various benchmarks, we explore the application of
DeepSeek to biomedical QA and further fine-tune it for domain-specific adaptation.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>3.1. Phase A</p>
      <p>Inspired by the work of Samy Ateia and Udo Kruschwitz[16], in Phase A, we built a query
expansiondriven multi-stage retrieval and reranking framework based on PubMed. The framework consists of
four components: query expansion, PubMed retrieval, document filtering, and snippet reranking. The
detailed process is illustrated in Figure 1.</p>
      <p>We first perform few-shot query expansion on the original question. Using the DeepSeek model
and a set of predefined examples, we generate a structured Boolean query expression. These predefined
examples were selected by evaluating our own results on historical BioASQ datasets using F1 scores
and choosing the highest-scoring entries, with the aim of enhancing the model’s ability to handle
current inputs by providing high-quality references. The output is then post-processed using regular
expressions to remove redundant tags and adapt the format for the PubMed API.</p>
      <p>Then, we use the PubMed API to retrieve articles based on the expanded query. The retrieval
is limited to articles published between 2000 and 2025, in accordance with the BioASQ Task 13b
requirement to use the 2025 PubMed baseline version. To ensure coverage of contemporary biomedical
research, the year 2000 is selected as a reasonable starting point. If the initial query returns no results, a
query refinement module is triggered. In this step, the original keywords that failed to retrieve articles
are also included in the prompt, allowing LLMs to generate a broader query that retains the original
context and relevance.</p>
      <p>Next, we retrieve titles and abstracts for each PMID returned. We then use a large language
model-based snippet extraction module to identify semantically relevant passages from the returned
articles. If no relevant snippets are found, the system will re-trigger the query refinement and retrieval
process, with a maximum of two iterations—a limit determined empirically: we tested 1, 2, and 3
iterations and found that two iterations yielded the best overall performance.</p>
      <p>Finally, we rank the extracted snippets and reorder the original list of articles based on snippet
relevance. This ensures that the most relevant snippets appear at the top of the final system output,
enhancing answer usability and accuracy. The system ultimately selects the 10 most helpful snippets
from the retrieved results.</p>
      <sec id="sec-3-1">
        <title>3.2. Phase A+ and Phase B</title>
        <p>The methods used in Phase A+ and Phase B are identical; the only diference lies in the source of
the input documents and snippets. In Phase A+, the relevant articles and snippets are retrieved by our
own system during Phase A, whereas in Phase B, they are oficially provided and selected by experts.
The detailed process is illustrated in Figure 2.</p>
        <p>In both Phase A+ and Phase B, we adopt a question-type-specific prompting strategy to handle the
four types of questions: yes/no, factoid, list, and summary. Table 1 illustrates the prompt specifically
designed for list-type questions in Phase B, which resulted in the best performance of our model on the
corresponding task. For each type, a tailored prompt is constructed, incorporating the relevant text
snippets to help LLMs better answer the question. For yes/no, factoid, and list questions, the model is
required to generate both the ‘exact_answer‘ and the "ideal_answer". For summary questions, only the
[文献片段] {Snippets}
[问题]：{问题} [Question]: {question}
[要求]： [Instructions]:
1. 返回严格符合此格式的JSON： 1. Return a JSON strictly in this
{{"entities": ["实体1", "实体2"]}} format:
2. 实体必须来自文献片段 {"entities": ["entity1", "entity2"]}
3. 按文献中出现频率排序（高频在前） 2. Entities must be extracted from the
4. 每个实体必须是名词短语（2-5个单词） snippets
5. 最多返回100个不同实体 3. Sort entities by frequency of
6. 不要任何解释性文字 occurrence (high to low)
[示例]： 4. Each entity must be a noun phrase
问题：What are the common symptoms of (2-5 words)
COVID-19? 5. Return up to 100 unique entities
响应：{{"entities": ["fever", "cough", "fatigue", "loss of 6. Do not include any explanatory text
smell"]}} [Example]:
Question: What are the common symptoms
of COVID-19?
Response: {{"entities": ["fever",
"cough", "fatigue", "loss of smell"]}}
"ideal_answer" is needed.</p>
        <p>Moreover, in our system, we apply SFT to a general-purpose open-source LLMs using historical
BioASQ QA data. The model was trained to generate answers for all question types in Phase A+ and
Phase B of BioASQ. In Phase A+,compared to the prompt-only version of our system, the SFT model
shows improved consistency and factual accuracy, particularly for factoid and list questions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>4.1. Task 13 B Phase A</p>
      <p>We participated in batches 1, 2, and 4 of Task 13b Phase A. Based on the work of Samy Ateia and
Udo Kruschwitz [16], we limited the number of few-shot examples to 10 in all our systems. We used
two models: deepseek-r1:32b, a 32-billion-parameter multimodal AI model developed by DeepSeek and
deployed locally, and deepseek-r1:671b, which was accessed via API.</p>
      <p>Table 21 summarizes all the runs we submitted to the BioASQ 13b task, covering Phase A, A+, and
B. Each row corresponds to a specific run configuration. The System Name column indicates the group
of runs we designed, such as ZUT-IR-1 to ZUT-IR-3. Submit Name and Run denote the submission
identifier and the order of the run within the system. Model refers to the large language model used in
the run, while Original Name indicates the original name of the system. The last three columns (Phase
A, Phase A+, and Phase B) specify whether the run was submitted to the respective phases. For example,
ZUT-IR-2-b represents the second run under system ZUT-IR-2, which uses the DeepSeek-r1:8b-distill
model. This run was submitted to both Phase A+ and Phase B, and its original name is deepseek32b-full.</p>
      <p>We compared the performance of deepseek-r1:671b and deepseek-r1:32b on Task 13b Phase
A. In both batch 1 and batch 2, the system using deepseek-r1:671b outperformed the one using
deepseek-r1:32b in terms of MAP evaluation, in both document retrieval and snippet extraction. In
batch 4, the system using deepseek-r1:32b significantly outperformed the one using deepseek-r1:671b
in terms of MAP evaluation, both in document retrieval and snippet extraction. Among them, for
document retrieval, the MAP score of the deepseek-r1:32b-based system was 0.1014, whereas the
score of the deepseek-r1:671b-based system was only 0.0586. Similar observations have been reported
in previous studies. Srivastava et al. [17] found in the BIG-bench benchmark that diferent models
exhibit significant variation in performance across tasks, which may be due to the alignment between
1Only in Phase B, the Original Names of ZUT-IR-1-a and ZUT-IR-1-b are deepseek32b-f and phaseB-4, respectively. In all
other phases, their Original Names are deepseek32b-me and deepseek32b-full. Similarly, only in Phase A, the Original Name
of ZUT-IR-3-a is deepseek32b-full, while in other phases it corresponds to phaseB-4. In the fourth batch of Phase A+, the
Original Name of ZUT-IR-1-a is deepseek32b-f, while in the other batches, it is deepseek32b-me.
2“Top Competitor” refers to the highest-ranked system in that batch, which is not ours. When the top competitor is absent in
a reported batch, one of our systems is the best-performing one.
task characteristics and model capabilities. Furthermore, Wei et al. [18] pointed out that in certain
reasoning tasks, smaller models may outperform larger ones due to their preference for specific
patterns. Therefore, our observation that deepseek-r1:32b outperforms larger models in Batch 4 may be
attributed to a distribution of questions that better aligns with its strengths.
4.2. Task 13B Phase A+</p>
      <p>In Phase A+, we also used 10-shot learning across all systems. The specific system configurations
are shown in Table 2. The majority of our systems leveraged the deepseek-r1:32b model, with the
deepseek-r1:671b model used selectively in later-phase systems. In batches 3 and 4, we further conducted
SFT on the deepseek-r1:8b model as part of our experiments.</p>
      <p>Table 5 presents the yes/no results for exact questions in Phase A+. In batch 1, for the exact
yes/no questions, we submitted two systems, both based on the deepseek-r1:32b model. In batch 2, we
submitted five systems. The results show that both systems using deepseek-r1:671b outperformed the
three systems based on deepseek-r1:32b. Among them, the best-performing system was the second
run based on the deepseek-r1:671b model, which ranked 10th out of all 49 submitted systems. In batch
3, we submitted two systems, both using the SFT deepseek-r1:8b model. The results indicated that
both systems performed poorly. In batch 4, we submitted five systems. Among them, the “-a” systems
utilized relevant snippets retrieved in Phase A by the deepseek-r1:32b model, while the “-b” systems
used snippets obtained in Phase A by the deepseek-r1:671b model. Among the two systems using the
deepseek-r1:671b model in Phase A+, ZUT-IR-3-a—whose snippets were retrieved by deepseek-r1:32b in
Phase A—ranked 1st out of all 67 submitted systems; in contrast, ZUT-IR-3-b—whose snippets came from
deepseek-r1:671b—ranked only 26th. A similar trend was observed between ZUT-IR-2-a and ZUT-IR-2-b,
where the only diference was the snippet retrieval model used. These observations underscore the
critical impact of snippet retrieval quality on answer generation performance in Phase A+. It is worth
noting that the systems based on the SFT deepseek-r1:8b model performed very poorly in batch 3, and
were also outperformed in batch 4 by systems using both deepseek-r1:671b and deepseek-r1:32b.</p>
      <p>Table 6 presents the factoid results for exact questions in Phase A+. For the exact factoid questions,
the system configurations across batches were consistent with those described for the exact yes/no
questions. In batch 1, the two systems we submitted based on the deepseek-r1:32b model performed
poorly. In batch 2, we submitted five systems. Consistent with the results for the exact yes/no questions,
the two systems using deepseek-r1:671b outperformed the three systems based on deepseek-r1:32b. In
batch 3, the two systems (ZUT-IR-2-a and ZUT-IR-2-b) based on the supervised fine-tuned
deepseekr1:8b model achieved competitive results, ranking 16th and 17th among the 58 submitted systems.
In batch 4, results showed that the two systems using the SFT deepseek-r1:8b model achieved the
best performance, followed by the systems based on deepseek-r1:32b, while the two systems using
Batch
1
2
3
4
1
2
3
4</p>
      <p>Position
1 of 56
51 of 56
56 of 56
1 of 49
10 of 49
11 of 49
18 of 49
21 of 49
27 of 49
1 of 58
45 of 58
46 of 58
1 of 67
26 of 67
39 of 67
41 of 67
52 of 67
1 of 56
42 of 56
46 of 56
1 of 49
24 of 49
25 of 49
30 of 49
31 of 49
33 of 49
1 of 58
16 of 58
17 of 58
1 of 67
10 of 67
24 of 67
39 of 67
40 of 67
41 of 67
deepseek-r1:671b performed the worst. Among the two systems based on the SFT deepseek-r1:8b model,
the one that utilized snippets retrieved by deepseek-r1:32b in Phase A outperformed the one that used
snippets retrieved by deepseek-r1:671b, further confirming the critical impact of snippet retrieval quality
on answer generation performance in Phase A+.</p>
      <p>Notably, in Phase A+, our SFT deepseek-r1:8b model performed well on both the exact factoid
and exact list questions, outperforming the locally deployed deepseek-r1:32b model and the API-based
deepseek-r1:671b model. However, its performance on the exact yes/no questions was suboptimal. This
performance pattern suggests that SFT may be particularly efective in guiding the model to produce
well-structured answers for factoid and list-type questions, which often follow predictable formats.
In contrast, yes/no questions typically require more nuanced semantic understanding and inferential
reasoning, which may benefit from the broader knowledge capacity and emergent reasoning abilities of
larger, non-fine-tuned models such as deepseek-r1:671b.</p>
      <sec id="sec-4-1">
        <title>4.3. Task 13B Phase B</title>
        <p>In Phase B, we also used 10-shot learning across all systems. The specific system configurations
are shown in Table 2. We submitted five systems in each of the four batches. We used three types of
models: a locally deployed deepseek-r1:32b model, an API-accessed deepseek-r1:671b model, and a SFT
deepseek-r1:8b model.</p>
        <p>Table 8 presents the yes/no results for exact questions in Phase B. In batch 1, for the exact yes/no
questions, all five systems using the deepseek-r1:32b model performed poorly. In batch 2, the results
showed that the two systems using the deepseek-r1:671b model outperformed the three systems based
on deepseek-r1:32b. Moreover, these two deepseek-r1:671b systems tied for first place among all 72
submitted systems, achieving an accuracy of 1. In batch 3, our submitted system ZUT-IR-1-b and system
ZUT-IR-3-a tied for first place among all 66 submitted systems. In batch 4, the best-performing system
was ZUT-IR-3-b, which ranked 9th out of 79 submitted systems. The other four systems achieved exactly
the same accuracy. It is worth noting that the deepseek-r1:671b model consistently outperformed its
smaller counterparts in handling yes/no questions across multiple batches in Phase B. This observation
suggests that larger models may possess stronger generalization and inferential reasoning capabilities,
which are particularly beneficial for binary classification tasks. In contrast, the SFT deepseek-r1:8b
model exhibited relatively poor and inconsistent performance, possibly due to limited coverage or
1
2
3
4
distributional bias in the fine-tuning data, which may hinder its ability to handle semantically diverse
or ambiguous yes/no questions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>We demonstrated the advanced performance of the emerging DeepSeek models in biomedical
retrieval scenarios when combined with 10-shot learning. In BioASQ Task 13b, we utilized three diferent
configurations of DeepSeek models: the locally deployed deepseek-r1:32b, the SFT deepseek-r1:8b, and
the API-based deepseek-r1:671b. Notably, in Phase A+, our system using the deepseek-r1:671b model
combined with retrieval-augmented generation techniques ranked first among all 67 submitted runs on
yes/no questions in Batch 4. In Phase B, systems using both the deepseek-r1:32b and deepseek-r1:671b
models achieved top performance on yes/no questions in Batches 2 and 3. Additionally, the system
using the deepseek-r1:32b model ranked first on list questions in Batch 1. Our SFT deepseek-r1:8b
model performed well only on the exact factoid and exact list questions in Phase A+, but showed poor
performance in other phases.</p>
      <p>Our findings suggest that SFT may be particularly efective in guiding the model to produce
well-structured answers for factoid and list-type questions, which often follow predictable formats.
In contrast, yes/no questions typically require more nuanced semantic understanding and inferential
reasoning, which may benefit more from the broader knowledge capacity and emergent reasoning
abilities of larger, non-fine-tuned models such as deepseek-r1:671b.</p>
      <p>In the future, we will conduct in-depth research on the application of fine-tuned models in Task
13b, aiming to further explore their capabilities and limitations across diferent question types and
retrieval settings.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research work was funded by:1) the Key Scientific Research Project of Higher Education
Institutions in Henan Province, grant no. 24A520060. 2) Graduate Education and Teaching Reform
Research Project of Zhongyuan University of Technology, grant no. JG202434.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this manuscript, the author used ChatGPT for text translation. After
utilizing this tool, the author carefully reviewed, revised, and edited the translated content, and assumes
full responsibility for the accuracy and integrity of the final version.
[8] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and eficient
foundation language models, 2023. URL: https://arxiv.org/abs/2302.13971. arXiv:2302.13971.
[9] Q. Chen, Y. Hu, X. Peng, Q. Xie, Q. Jin, A. Gilson, M. B. Singer, X. Ai, P.-T. Lai, Z. Wang, et al.,
Benchmarking large language models for biomedical natural language processing applications
and recommendations, Nature communications 16 (2025) 3280.
[10] R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, T.-Y. Liu, Biogpt: generative pre-trained transformer
for biomedical text generation and mining, Briefings in bioinformatics 23 (2022) bbac409.
[11] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le, Finetuned
language models are zero-shot learners, arXiv preprint arXiv:2109.01652 (2021).
[12] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
A. Ray, et al., Training language models to follow instructions with human feedback, Advances in
neural information processing systems 35 (2022) 27730–27744.
[13] E. Bolton, A. Venigalla, M. Yasunaga, D. Hall, B. Xiong, T. Lee, R. Daneshjou, J. Frankle, P. Liang,
M. Carbin, et al., Biomedlm: A 2.7 b parameter language model trained on biomedical text, arXiv
preprint arXiv:2403.18421 (2024).
[14] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, Journal of machine learning
research 21 (2020) 1–67.
[15] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the 2019 conference of the North American chapter
of the association for computational linguistics: human language technologies, volume 1 (long
and short papers), 2019, pp. 4171–4186.
[16] S. Ateia, U. Kruschwitz, Can open-source llms compete with commercial models? exploring the
few-shot performance of current gpt models in biomedical tasks, arXiv preprint arXiv:2407.13511
(2024).
[17] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro,
A. Gupta, A. Garriga-Alonso, et al., Beyond the imitation game: Quantifying and extrapolating
the capabilities of language models, arXiv preprint arXiv:2206.04615 (2022).
[18] J. Wei, Y. Tay, R. Bommasani, C. Rafel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou,
D. Metzler, et al., Emergent abilities of large language models, arXiv preprint arXiv:2206.07682
(2022).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.-C.</given-names>
            <surname>Chih</surname>
          </string-name>
          , J.-C. Han,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tzong-Han</surname>
          </string-name>
          <string-name>
            <surname>Tsai</surname>
          </string-name>
          ,
          <article-title>Ncu-iisr: enhancing biomedical question answering with gpt-4 and retrieval augmented generation in bioasq 12b phase b</article-title>
          ,
          <source>CLEF Working Notes</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. M. D. Nunzio</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Marchesin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras, Overview of bioasq
          <year>2025</year>
          :
          <article-title>The thirteenth bioasq challenge on large-scale biomedical semantic indexing and question answering</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Ngo</surname>
          </string-name>
          ,
          <article-title>Using pretrained large language model with prompt engineering to answer biomedical questions</article-title>
          ,
          <source>arXiv preprint arXiv:2407.06779</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Paliouras, Overview of BioASQ Tasks 13b and Synergy13 in CLEF2025</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bougiatiotis</surname>
          </string-name>
          , G. Paliouras,
          <article-title>Bioasq-qa: A manually curated corpus for biomedical question answering</article-title>
          ,
          <source>Scientific Data</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>170</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ying</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Biomedical question answering: a survey of approaches and challenges</article-title>
          ,
          <source>ACM Computing Surveys (CSUR) 55</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>