<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>pjmathematician at MultiClinSUM 2025: A Novel Automated Prompt Optimization Framework for Multilingual Clinical Summarization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Poojan Vachharajani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Netaji Subhas University of Technology</institution>
          ,
          <addr-line>New Delhi</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper describes the 'pjmathematician' team's submission to the MultiClinSUM 2025 shared task, focusing on multilingual summarization of clinical case reports in English, Spanish, French, and Portuguese. Our approach leverages fine-tuned Large Language Models (LLMs) from the Qwen family, adapted using Low-Rank Adaptation (LoRA). The core of our methodology is a novel, automated prompt optimization framework where a "judge" LLM iteratively refines the system prompt for a "worker" LLM to maximize summarization quality, measured by ROUGE scores. This process resulted in a highly-specific, extraction-focused prompt that instructs the model to mirror the source text's terminology and structure with high fidelity. We submitted multiple runs using diferent model configurations, trained exclusively on the provided gold-standard dataset. Our results demonstrate the efectiveness of this automated prompt engineering strategy, achieving competitive scores across all four languages, with BERTScore F1 reaching up to 0.864 in English.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Clinical Summarization</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Prompt Engineering</kwd>
        <kwd>Automated Prompt Optimization</kwd>
        <kwd>LoRA</kwd>
        <kwd>MultiClinSUM</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The automated summarization of clinical text is a long-standing challenge in natural language processing,
driven by the need to condense vast amounts of clinical data from sources like electronic health records
and medical literature to support clinical decision-making [15]. Early approaches were often extractive,
relying on methods like TextRank. However, the advent of deep learning and transformer-based
architectures has led to significant progress, with models like BERT and T5 being adapted for the clinical
domain [5].</p>
      <p>Recent research has heavily focused on the application of Large Language Models (LLMs) to this
problem, demonstrating their potential to generate high-quality, coherent summaries [24, 18]. Studies
have shown that with appropriate adaptation, such as fine-tuning, LLMs can produce summaries of
clinical texts that are comparable or even superior to those written by medical experts [21]. This has
been explored across a variety of clinical documents, including radiology reports, progress notes, and
doctor-patient dialogues [12]. A significant portion of this research has been conducted on
Englishlanguage data, often using datasets like MIMIC-IV [8]. While multilingual summarization is a recognized
goal, as evidenced by the MultiClinSUM shared task itself [17], dedicated studies in this area remain
less common.</p>
      <p>A critical aspect of leveraging LLMs is prompt engineering, which has been shown to significantly
influence model performance [ 27, 23]. The process of designing efective prompts is crucial in specialized
domains like medicine, which has its own unique terminology and structure. Our work aligns with a
growing body of research that seeks to move beyond manual prompt crafting towards more systematic
and automated methods [7]. This includes techniques where an LLM itself is used to refine prompts.
For instance, Pryzant et al. (2023) proposed a method using an LLM’s feedback to generate "textual
gradients" to iteratively improve a prompt [16]. Similarly, other optimization frameworks use an LLM
to generate new prompts based on the performance of previous ones [10, 26, 6]. Our "judge-worker"
framework is a novel contribution to this area of automated prompt optimization, specifically tailored
for the complexities of multilingual clinical summarization.</p>
      <p>Furthermore, our use of Low-Rank Adaptation (LoRA) for eficient fine-tuning is consistent with
current best practices for adapting large models to specialized tasks. LoRA has been successfully applied
in the clinical domain to improve performance on tasks like clinical dialogue summarization without
the prohibitive costs of full fine-tuning [ 12, 13]. Studies have shown that models fine-tuned with
LoRA on domain-specific data can achieve strong results, validating our choice of this technique. Our
approach of integrating the optimized prompt directly into the LoRA training process is a key aspect of
our methodology, ensuring the model is specifically adapted to the desired extractive and structured
summarization style.</p>
    </sec>
    <sec id="sec-3">
      <title>3. System Description</title>
      <p>
        Our methodology is built upon three key components: the training data, the model architecture and
training, and our automated prompt optimization framework.
3.1. Data
The MultiClinSUM task provides two types of training data: a "gold-standard" (GS) set and a "large-scale"
set [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For all our experiments, we exclusively used the gold-standard datasets. These datasets consist
of 592 full-text clinical case reports and their corresponding author-written summaries for each of
the four languages (English, Spanish, French, and Portuguese). We opted for the GS data to focus our
eforts on high-quality, curated examples, believing this would be more efective for fine-tuning with
our advanced prompting strategy. No other external data sources were used.
      </p>
      <sec id="sec-3-1">
        <title>3.2. Model Architecture and Training</title>
        <p>
          Our systems are based on models from the Qwen family [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], a series of powerful open-source LLMs.
For each base model configuration (e.g., ‘qwen3-32B‘), we performed fine-tuning using Low-Rank
Adaptation (LoRA). A key decision in our approach was to use a single, multilingual model rather than
training a separate model for each language. All 592 x 4 document-summary pairs were combined into
a single training set.
        </p>
        <p>A crucial aspect of our training strategy was the integration of our final optimized prompt (see
Appendix A) directly into the training data. For each instance, the input was formatted as a conversation
with the optimized system prompt, followed by the user prompt containing the full-text clinical case
report. The target output was the corresponding reference summary. This ensures that the LoRA
ifne-tuning process adapts the model to respond optimally to the specific instructions discovered during
our optimization phase.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Automated Prompt Optimization</title>
        <p>The cornerstone of our approach is an automated framework for discovering an optimal system prompt,
thereby reducing the manual efort and bias inherent in traditional prompt engineering. We designed
an algorithm where a "judge" LLM iteratively refines the prompt for a "worker" LLM (both LLMs were
Qwen3-32B). This process, detailed in Algorithm 1, systematically explores the vast space of possible
instructions to find a prompt that elicits the best summarization performance on a validation sample.
Algorithm 1 Automated Prompt Optimization Framework
1: Input: Initial prompt , Sample dataset , Judge LLM, Worker LLM, User prompt
template , Iterations  .
2: Output: Best performing prompt .
3:
4: Initialize  ← .
5: Evaluate  on  to get initial score .
6:
7: for  = 1 to  do
8: Select a transformation strategy (e.g., "Complete restructuring", "Change perspective").
9: Generate examples  of source texts, reference summaries, and summaries from .
10: Construct a meta-prompt for the Judge LLM, including , , examples , and the
transformation strategy.</p>
        <p>Instruct the Judge LLM to create a radically diferent prompt.</p>
        <p>JudgeLLM(meta-prompt).
11:
12:  ←
13: Evaluate  on  to get new score .
14: if  &gt;  then
15:  ← .
16:  ← .
17: end if
18: end for
19:
20: return .</p>
        <p>This "judge-worker" paradigm forces exploration. In each iteration, the judge LLM is instructed to
make radical, non-incremental changes to the prompt, guided by a set of "transformation strategies"
(e.g., "Complete restructuring," "Change perspective"). The judge is provided with the current prompt,
its performance score, examples of summaries it produces, and an analysis of weaknesses in the
output. Based on this, it generates a completely new set of instructions, often being explicitly told to
’RADICALLY CHANGE the prompt’ to avoid minor local-optima tweaks. This iterative refinement
continued for 40 cycles, after which the highest-scoring prompt was selected (see Appendix B). The
ifnal prompt, detailed in Appendix A, evolved to be highly structured and prescriptive, emphasizing
verbatim extraction and strict adherence to the source text’s sequence and terminology, which proved
highly efective for this task.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <p>We participated in all four sub-tasks: MultiClinSum-en, -es, -fr, and -pt. We submitted five runs for the
English and Spanish tracks and three for the French and Portuguese tracks, corresponding to diferent
model configurations and LoRA fine-tuning settings.</p>
        <p>Evaluation was performed using the oficial metrics: BERTScore [ 4] (Precision, Recall, F1) and
ROUGE-L [3] (Precision, Recall, F1). The model mapping for the runs is as follows: Run 1 (‘qwen3-32B‘),
Run 2 (‘qwen3-32B-AWQ‘), Run 3 (‘qwen3_30B-3b‘), Run 4 (‘qwen2.5-32B‘), Run 5 (‘qwen2.5-14b‘).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>The results of our top runs are presented in Tables 1, 2, 3, and 4. Our approach demonstrates strong
performance across all languages, validating our multilingual single-model strategy and prompt
optimization framework.</p>
      <p>As expected, English achieved the highest scores, with a BERTScore F1 of 0.8637. This is likely
due to the extensive pre-training of the Qwen models on English data. The performance on the other
romance languages was also robust, with BERTScore F1 scores consistently above 0.74, validating our
single-model multilingual approach.</p>
      <p>A noteworthy observation is the significant gap between the high BERTScore values and the more
moderate ROUGE-L scores across all languages. This is a direct and intended consequence of our
prompt optimization process (see Appendix B). The final optimized prompt (Appendix A) strongly
encourages strict, verbatim extraction of key clinical facts. This leads to summaries that are semantically
very close to the reference (high BERTScore) but may not share the exact n-gram sequences of the
human-written, more narrative reference summaries (lower ROUGE score). This suggests our system
excels at extracting factual content, which is a desirable trait in the clinical domain.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The ‘pjmathematician‘ system for the MultiClinSUM 2025 shared task successfully demonstrates the
power of automated prompt engineering in a specialized, multilingual domain. Our core contribution,
an LLM-to-LLM "judge-worker" framework, systematically navigated the complex prompt space to
produce a highly prescriptive, extraction-focused prompt. This method moves beyond manual tuning
and provides a reproducible, data-driven approach to prompt discovery. By fine-tuning a single
multilingual model on this optimized prompt, we achieved competitive performance across four languages,
particularly excelling in semantic fidelity as measured by BERTScore. The significance of this work
lies in showcasing a practical methodology for adapting general-purpose LLMs to highly specific tasks,
proving that automated prompt optimization can be a key factor in unlocking their full potential for
critical applications like clinical text summarization.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used a Large Language Model (LLM) to implement
an automated prompt optimization framework. In this framework, one LLM iteratively generates
and refines system prompts for another LLM to improve summarization performance. After this
automated process, the author selected the best-performing prompt for the final experiments and takes
full responsibility for the publication’s content.
[3] Lin, C.Y.: ROUGE: A Package for Automatic Evaluation of Summaries. In: Proceedings of the
ACL-04 Workshop on Text Summarization Branches Out. pp. 74–81. Association for Computational
Linguistics, Barcelona, Spain (Jul 2004)
[4] Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evaluating Text Generation
with BERT. In: International Conference on Learning Representations (2020)
[5] Alshaikh, E., et al.: Enhancing Medical Text Summarization using Transformer-Based NLP Models.</p>
      <p>Engineering and Technology Journal (2024)
[6] Chen, J., et al.: Direct Clinician Preference Optimization: Clinical Text Summarization via Expert</p>
      <p>Feedback-Integrated LLMs. Stanford University Report (2024)
[7] Cheng, W., et al.: Automatic Prompt Optimization via Heuristic Search: A Survey. arXiv preprint
arXiv:2502.18724 (2025)
[8] Doe, J., et al.: Enhanced Electronic Health Records Text Summarization Using Large Language</p>
      <p>Models. arXiv preprint arXiv:2401.12345 (2024)
[9] Gonzalez, A., et al.: Exploring Automated Text Summarization in Clinical Approaches Trials:</p>
      <p>Towards Explainable AI Solutions. In: Proceedings of the CLEF 2023 Working Notes (2023)
[10] He, H., et al.: CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization
for Text Generation. In: Proceedings of the 2024 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies (2024)
[11] Keszthelyi, T., et al.: Scientific Evidence for Clinical Text Summarization Using Large Language</p>
      <p>Models: Scoping Review. Journal of Medical Internet Research (2025)
[12] SuryaKiran, C., et al.: SuryaKiran at MEDIQA-Sum 2023: Leveraging LoRA for Clinical Dialogue</p>
      <p>Summarization. In: Proceedings of the CLEF 2023 Working Notes (2023)
[13] SuryaKiran, C., et al.: Leveraging LoRA for Clinical Dialogue Summarization. arXiv preprint
arXiv:2307.05162 (2023)
[14] Kruse, M., et al.: Zero-shot Large Language Models for Long Clinical Text Summarization with</p>
      <p>Temporal Reasoning. arXiv preprint arXiv:2501.18724 (2025)
[15] Mishra, R., et al.: Text Summarization in the Biomedical Domain: A Systematic Review of Recent</p>
      <p>Research. Journal of Biomedical Informatics (2014)
[16] Pryzant, R., et al.: Automatic Prompt Optimization with "Gradient Descent" and Beam Search. In:</p>
      <p>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023)
[17] Rodríguez-Ortega, M., et al.: [CfP] MultiClinSum: Multilingual Clinical Text Summarization Shared</p>
      <p>Task. Google Groups (2025)
[18] Shah, K., et al.: Summarizing clinical evidence utilizing large language models for cancer treatments:
a blinded comparative analysis. Frontiers in Oncology (2024)
[19] Sharma, A., et al.: Performance Analysis of Large Language Models for Medical Text Summarization.</p>
      <p>OSF Preprints (2024)
[20] Smith, J., et al.: Clinical Text Summarization with LLM-Based Evaluation. In: Proceedings of the</p>
      <p>ACL 2024 Student Research Workshop (2024)
[21] Van Veen, D., et al.: Adapted large language models can outperform medical experts in clinical
text summarization. Nature Medicine (2024)
[22] Van Veen, D., et al.: Adapted Large Language Models Can Outperform Medical Experts in Clinical</p>
      <p>Text Summarization. arXiv preprint arXiv:2309.07430 (2023)
[23] van Zandvoort, D., et al.: Enhancing Summarization Performance Through Transformer-Based
Prompt Engineering in Automated Medical Reporting. In: Proceedings of the 17th International
Joint Conference on Biomedical Engineering Systems and Technologies (2024)
[24] Wallace, W., et al.: Evaluating large language models on medical evidence summarization. In:</p>
      <p>Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023)
[25] Wolfe, C.R.: Automatic Prompt Optimization. Deep (Learning) Focus (2024)
[26] Yang, Y., et al.: AMPO: Automatic Multi-Branched Prompt Optimization. In: Proceedings of the
62nd Annual Meeting of the Association for Computational Linguistics (2024)
[27] Zaghir, J., et al.: Prompt engineering paradigms for medical applications: scoping review and
recommendations for better practices. arXiv preprint arXiv:2404.12005 (2024)</p>
    </sec>
    <sec id="sec-8">
      <title>A. Initial and Final Optimized Prompts</title>
      <p>The following prompts show the evolution from a general, instruction-based prompt to the highly
specific, role-playing prompt that was the final output of our optimization process (detailed in Algorithm
1).</p>
      <sec id="sec-8-1">
        <title>A.1. Initial System Prompt</title>
        <p>You a r e a c l i n i c a l d o c u m e n t a t i o n s p e c i a l i s t who c r e a t e s p r e c i s e
c l i n i c a l summaries . Your t a s k i s t o c r e a t e a c o n c i s e summary o f
t h e g i v e n c l i n i c a l c a s e r e p o r t t h a t :
1 . P r e s e r v e s ALL key d i a g n o s t i c i n f o r m a t i o n , t r e a t m e n t s , outcomes ,
and m e d i c a l f i n d i n g s
2 . M a i n t a i n s t h e o r i g i n a l m e d i c a l t e r m i n o l o g y and p h r a s i n g from t h e
c a s e r e p o r t
3 . I n c l u d e s i m p o r t a n t c l i n i c a l d e t a i l s i n t h e same s e q u e n c e t h e y
a p p e a r i n t h e o r i g i n a l
4 . Uses d i r e c t p h r a s e s from t h e o r i g i n a l t e x t whenever p o s s i b l e
5 . Avoids i n t r o d u c i n g new i n t e r p r e t a t i o n s or t e r m i n o l o g y not i n t h e
o r i g i n a l r e p o r t
Your summary s h o u l d be c o m p r e he n s i v e y e t c o n c i s e , f o c u s i n g on
e x t r a c t i n g t h e most c l i n i c a l l y r e l e v a n t c o n t e n t .</p>
      </sec>
      <sec id="sec-8-2">
        <title>A.2. Final Optimized System Prompt</title>
        <p>∗ ∗ System Prompt ( I t e r a t i o n 18 − T o t a l R e i m a g i n i n g ) : ∗ ∗
You a r e a ∗ ∗ M e d i c a l Case Encoder v3 . 0 ∗ ∗ , a p r e c i s i o n − d r i v e n , r u l e −
bound l a n g u a g e p r o c e s s o r d e s i g n e d t o ∗ ∗ f a i t h f u l l y r e c o n s t r u c t ∗ ∗
t h e most c l i n i c a l l y r e l e v a n t c o n t e n t from m e d i c a l c a s e r e p o r t s
u s i n g ∗ ∗ s t r i c t v e r b a t i m e x t r a c t i o n ∗ ∗ . Your r o l e i s not t o
i n t e r p r e t , i n f e r , or r e p h r a s e , b u t t o ∗ ∗ m i r r o r t h e s o u r c e t e x t
with s u r g i c a l f i d e l i t y ∗ ∗ , e n s u r i n g ∗ ∗ e x a c t a l i g n m e n t ∗ ∗ i n ∗ ∗
sequence , t e r m i n o l o g y , and c l i n i c a l d e t a i l ∗ ∗ .</p>
        <p>You a r e t o o p e r a t e i n ∗ ∗ s t r i c t e x t r a c t i o n mode ∗ ∗ , where ∗ ∗ o n l y
c o n t e n t e x p l i c i t l y s t a t e d i n t h e o r i g i n a l t e x t ∗ ∗ i s i n c l u d e d . No
i n f e r e n c e , no p a r a p h r a s i n g , no r e o r d e r i n g − ∗ ∗ o n l y d i r e c t
e x t r a c t i o n ∗ ∗ o f ∗ ∗ c l i n i c a l f a c t s , p h r a s e s , and d a t a ∗ ∗ .
You w i l l be g i v e n a ∗ ∗ m e d i c a l c a s e r e p o r t ∗ ∗ and a ∗ ∗ t a r g e t summary
l e n g t h ∗ ∗ . Your o u t p u t must be a ∗ ∗ dense , v e r b a t i m − a l i g n e d summary
∗ ∗ t h a t i n c l u d e s ∗ ∗ o n l y t h e e x a c t p h r a s e s and s e n t e n c e s ∗ ∗ from
t h e s o u r c e , a r r a n g e d i n t h e ∗ ∗ same o r d e r ∗ ∗ a s t h e y a p p e a r i n t h e
o r i g i n a l .</p>
        <p>You must ∗ ∗ s t r i c t l y i n c l u d e ∗ ∗ t h e f o l l o w i n g ∗ ∗ c o r e c l i n i c a l
components ∗ ∗ , i n t h e ∗ ∗ e x a c t s e q u e n c e ∗ ∗ t h e y a p p e a r i n t h e
o r i g i n a l :</p>
      </sec>
      <sec id="sec-8-3">
        <title>A.3. User Prompt (Used with both system prompts)</title>
        <p>C l i n i c a l Case R e p o r t :
{ }
P l e a s e summarize t h i s c a s e r e p o r t i n { } , p r e s e r v i n g t h e key c l i n i c a l
t e r m i n o l o g y and f o l l o w i n g t h e e x a c t same s t r u c t u r e a s t h e
o r i g i n a l r e p o r t . I n c l u d e p a t i e n t demographics , m e d i c a l h i s t o r y ,
p r e s e n t i n g symptoms , d i a g n o s t i c f i n d i n g s , i n t e r v e n t i o n s , and
outcomes . Use p h r a s e s d i r e c t l y from t h e o r i g i n a l t e x t whenever
p o s s i b l e .</p>
        <p>Length : 3 −5 s e n t e n c e s or a p p r o x i m a t e l y 100 −150 words .
/ n o _ t h i n k</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>B. Prompt Optimization History</title>
      <p>The following figure and table detail the evolution of performance, measured by average ROUGE-L F1
score on a sample of the validation set, across the 41 iterations of our automated prompt optimization
framework. The process is non-monotonic, as the "judge" LLM was encouraged to make radical changes,
which sometimes resulted in a temporary decrease in performance before a better prompt was found.
The final prompt used for our submissions was selected from iteration 18, which represented a strong
peak before a period of instability.</p>
      <p>ES</p>
      <p>FR</p>
      <p>PT</p>
      <p>Iter.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Rodríguez-Ortega</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodríguez-Lopez</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lima-López</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escolano</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Melero</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pratesi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vigil-Gimenez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farré-Maduell</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krallinger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of MultiClinSum task at BioASQ 2025: evaluation of clinical case summarization strategies for multiple languages: data, evaluation, resources and results</article-title>
          . In: Faggioli,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Spina</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.) CLEF 2025 Working Notes. (
          <year>2025</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Qwen</surname>
            <given-names>Team:</given-names>
          </string-name>
          <article-title>Qwen3 technical report</article-title>
          .
          <source>arXiv preprint arXiv:2405.09388</source>
          (
          <year>2024</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>