1. Introduction

pjmathematician at MultiClinSUM 2025: A Novel Automated Prompt Optimization Framework for Multilingual Clinical Summarization

Poojan Vachharajani

0 0 Netaji Subhas University of Technology , New Delhi , India

2025

This paper describes the 'pjmathematician' team's submission to the MultiClinSUM 2025 shared task, focusing on multilingual summarization of clinical case reports in English, Spanish, French, and Portuguese. Our approach leverages fine-tuned Large Language Models (LLMs) from the Qwen family, adapted using Low-Rank Adaptation (LoRA). The core of our methodology is a novel, automated prompt optimization framework where a "judge" LLM iteratively refines the system prompt for a "worker" LLM to maximize summarization quality, measured by ROUGE scores. This process resulted in a highly-specific, extraction-focused prompt that instructs the model to mirror the source text's terminology and structure with high fidelity. We submitted multiple runs using diferent model configurations, trained exclusively on the provided gold-standard dataset. Our results demonstrate the efectiveness of this automated prompt engineering strategy, achieving competitive scores across all four languages, with BERTScore F1 reaching up to 0.864 in English.

eol>Clinical Summarization Large Language Models Prompt Engineering Automated Prompt Optimization LoRA MultiClinSUM

1. Introduction 2. Related Work

The automated summarization of clinical text is a long-standing challenge in natural language processing, driven by the need to condense vast amounts of clinical data from sources like electronic health records and medical literature to support clinical decision-making [15]. Early approaches were often extractive, relying on methods like TextRank. However, the advent of deep learning and transformer-based architectures has led to significant progress, with models like BERT and T5 being adapted for the clinical domain [5].

Recent research has heavily focused on the application of Large Language Models (LLMs) to this problem, demonstrating their potential to generate high-quality, coherent summaries [24, 18]. Studies have shown that with appropriate adaptation, such as fine-tuning, LLMs can produce summaries of clinical texts that are comparable or even superior to those written by medical experts [21]. This has been explored across a variety of clinical documents, including radiology reports, progress notes, and doctor-patient dialogues [12]. A significant portion of this research has been conducted on Englishlanguage data, often using datasets like MIMIC-IV [8]. While multilingual summarization is a recognized goal, as evidenced by the MultiClinSUM shared task itself [17], dedicated studies in this area remain less common.

A critical aspect of leveraging LLMs is prompt engineering, which has been shown to significantly influence model performance [ 27, 23]. The process of designing efective prompts is crucial in specialized domains like medicine, which has its own unique terminology and structure. Our work aligns with a growing body of research that seeks to move beyond manual prompt crafting towards more systematic and automated methods [7]. This includes techniques where an LLM itself is used to refine prompts. For instance, Pryzant et al. (2023) proposed a method using an LLM’s feedback to generate "textual gradients" to iteratively improve a prompt [16]. Similarly, other optimization frameworks use an LLM to generate new prompts based on the performance of previous ones [10, 26, 6]. Our "judge-worker" framework is a novel contribution to this area of automated prompt optimization, specifically tailored for the complexities of multilingual clinical summarization.

Furthermore, our use of Low-Rank Adaptation (LoRA) for eficient fine-tuning is consistent with current best practices for adapting large models to specialized tasks. LoRA has been successfully applied in the clinical domain to improve performance on tasks like clinical dialogue summarization without the prohibitive costs of full fine-tuning [ 12, 13]. Studies have shown that models fine-tuned with LoRA on domain-specific data can achieve strong results, validating our choice of this technique. Our approach of integrating the optimized prompt directly into the LoRA training process is a key aspect of our methodology, ensuring the model is specifically adapted to the desired extractive and structured summarization style.

3. System Description

Our methodology is built upon three key components: the training data, the model architecture and training, and our automated prompt optimization framework. 3.1. Data The MultiClinSUM task provides two types of training data: a "gold-standard" (GS) set and a "large-scale" set [ 1 ]. For all our experiments, we exclusively used the gold-standard datasets. These datasets consist of 592 full-text clinical case reports and their corresponding author-written summaries for each of the four languages (English, Spanish, French, and Portuguese). We opted for the GS data to focus our eforts on high-quality, curated examples, believing this would be more efective for fine-tuning with our advanced prompting strategy. No other external data sources were used.

3.2. Model Architecture and Training

Our systems are based on models from the Qwen family [ 2 ], a series of powerful open-source LLMs. For each base model configuration (e.g., ‘qwen3-32B‘), we performed fine-tuning using Low-Rank Adaptation (LoRA). A key decision in our approach was to use a single, multilingual model rather than training a separate model for each language. All 592 x 4 document-summary pairs were combined into a single training set.

A crucial aspect of our training strategy was the integration of our final optimized prompt (see Appendix A) directly into the training data. For each instance, the input was formatted as a conversation with the optimized system prompt, followed by the user prompt containing the full-text clinical case report. The target output was the corresponding reference summary. This ensures that the LoRA ifne-tuning process adapts the model to respond optimally to the specific instructions discovered during our optimization phase.

3.3. Automated Prompt Optimization

The cornerstone of our approach is an automated framework for discovering an optimal system prompt, thereby reducing the manual efort and bias inherent in traditional prompt engineering. We designed an algorithm where a "judge" LLM iteratively refines the prompt for a "worker" LLM (both LLMs were Qwen3-32B). This process, detailed in Algorithm 1, systematically explores the vast space of possible instructions to find a prompt that elicits the best summarization performance on a validation sample. Algorithm 1 Automated Prompt Optimization Framework 1: Input: Initial prompt , Sample dataset , Judge LLM, Worker LLM, User prompt template , Iterations . 2: Output: Best performing prompt . 3: 4: Initialize ← . 5: Evaluate on to get initial score . 6: 7: for = 1 to do 8: Select a transformation strategy (e.g., "Complete restructuring", "Change perspective"). 9: Generate examples of source texts, reference summaries, and summaries from . 10: Construct a meta-prompt for the Judge LLM, including , , examples , and the transformation strategy.

Instruct the Judge LLM to create a radically diferent prompt.

JudgeLLM(meta-prompt). 11: 12: ← 13: Evaluate on to get new score . 14: if > then 15: ← . 16: ← . 17: end if 18: end for 19: 20: return .

This "judge-worker" paradigm forces exploration. In each iteration, the judge LLM is instructed to make radical, non-incremental changes to the prompt, guided by a set of "transformation strategies" (e.g., "Complete restructuring," "Change perspective"). The judge is provided with the current prompt, its performance score, examples of summaries it produces, and an analysis of weaknesses in the output. Based on this, it generates a completely new set of instructions, often being explicitly told to ’RADICALLY CHANGE the prompt’ to avoid minor local-optima tweaks. This iterative refinement continued for 40 cycles, after which the highest-scoring prompt was selected (see Appendix B). The ifnal prompt, detailed in Appendix A, evolved to be highly structured and prescriptive, emphasizing verbatim extraction and strict adherence to the source text’s sequence and terminology, which proved highly efective for this task.

4. Experiments 4.1. Experimental Setup

We participated in all four sub-tasks: MultiClinSum-en, -es, -fr, and -pt. We submitted five runs for the English and Spanish tracks and three for the French and Portuguese tracks, corresponding to diferent model configurations and LoRA fine-tuning settings.

Evaluation was performed using the oficial metrics: BERTScore [ 4] (Precision, Recall, F1) and ROUGE-L [3] (Precision, Recall, F1). The model mapping for the runs is as follows: Run 1 (‘qwen3-32B‘), Run 2 (‘qwen3-32B-AWQ‘), Run 3 (‘qwen3_30B-3b‘), Run 4 (‘qwen2.5-32B‘), Run 5 (‘qwen2.5-14b‘).

5. Results and Discussion

The results of our top runs are presented in Tables 1, 2, 3, and 4. Our approach demonstrates strong performance across all languages, validating our multilingual single-model strategy and prompt optimization framework.

As expected, English achieved the highest scores, with a BERTScore F1 of 0.8637. This is likely due to the extensive pre-training of the Qwen models on English data. The performance on the other romance languages was also robust, with BERTScore F1 scores consistently above 0.74, validating our single-model multilingual approach.

A noteworthy observation is the significant gap between the high BERTScore values and the more moderate ROUGE-L scores across all languages. This is a direct and intended consequence of our prompt optimization process (see Appendix B). The final optimized prompt (Appendix A) strongly encourages strict, verbatim extraction of key clinical facts. This leads to summaries that are semantically very close to the reference (high BERTScore) but may not share the exact n-gram sequences of the human-written, more narrative reference summaries (lower ROUGE score). This suggests our system excels at extracting factual content, which is a desirable trait in the clinical domain.

6. Conclusion

The ‘pjmathematician‘ system for the MultiClinSUM 2025 shared task successfully demonstrates the power of automated prompt engineering in a specialized, multilingual domain. Our core contribution, an LLM-to-LLM "judge-worker" framework, systematically navigated the complex prompt space to produce a highly prescriptive, extraction-focused prompt. This method moves beyond manual tuning and provides a reproducible, data-driven approach to prompt discovery. By fine-tuning a single multilingual model on this optimized prompt, we achieved competitive performance across four languages, particularly excelling in semantic fidelity as measured by BERTScore. The significance of this work lies in showcasing a practical methodology for adapting general-purpose LLMs to highly specific tasks, proving that automated prompt optimization can be a key factor in unlocking their full potential for critical applications like clinical text summarization.

Declaration on Generative AI

During the preparation of this work, the author used a Large Language Model (LLM) to implement an automated prompt optimization framework. In this framework, one LLM iteratively generates and refines system prompts for another LLM to improve summarization performance. After this automated process, the author selected the best-performing prompt for the final experiments and takes full responsibility for the publication’s content. [3] Lin, C.Y.: ROUGE: A Package for Automatic Evaluation of Summaries. In: Proceedings of the ACL-04 Workshop on Text Summarization Branches Out. pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (Jul 2004) [4] Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: BERTScore: Evaluating Text Generation with BERT. In: International Conference on Learning Representations (2020) [5] Alshaikh, E., et al.: Enhancing Medical Text Summarization using Transformer-Based NLP Models.

Engineering and Technology Journal (2024) [6] Chen, J., et al.: Direct Clinician Preference Optimization: Clinical Text Summarization via Expert

Feedback-Integrated LLMs. Stanford University Report (2024) [7] Cheng, W., et al.: Automatic Prompt Optimization via Heuristic Search: A Survey. arXiv preprint arXiv:2502.18724 (2025) [8] Doe, J., et al.: Enhanced Electronic Health Records Text Summarization Using Large Language

Models. arXiv preprint arXiv:2401.12345 (2024) [9] Gonzalez, A., et al.: Exploring Automated Text Summarization in Clinical Approaches Trials:

Towards Explainable AI Solutions. In: Proceedings of the CLEF 2023 Working Notes (2023) [10] He, H., et al.: CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation. In: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2024) [11] Keszthelyi, T., et al.: Scientific Evidence for Clinical Text Summarization Using Large Language

Models: Scoping Review. Journal of Medical Internet Research (2025) [12] SuryaKiran, C., et al.: SuryaKiran at MEDIQA-Sum 2023: Leveraging LoRA for Clinical Dialogue

Summarization. In: Proceedings of the CLEF 2023 Working Notes (2023) [13] SuryaKiran, C., et al.: Leveraging LoRA for Clinical Dialogue Summarization. arXiv preprint arXiv:2307.05162 (2023) [14] Kruse, M., et al.: Zero-shot Large Language Models for Long Clinical Text Summarization with

Temporal Reasoning. arXiv preprint arXiv:2501.18724 (2025) [15] Mishra, R., et al.: Text Summarization in the Biomedical Domain: A Systematic Review of Recent

Research. Journal of Biomedical Informatics (2014) [16] Pryzant, R., et al.: Automatic Prompt Optimization with "Gradient Descent" and Beam Search. In:

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023) [17] Rodríguez-Ortega, M., et al.: [CfP] MultiClinSum: Multilingual Clinical Text Summarization Shared

Task. Google Groups (2025) [18] Shah, K., et al.: Summarizing clinical evidence utilizing large language models for cancer treatments: a blinded comparative analysis. Frontiers in Oncology (2024) [19] Sharma, A., et al.: Performance Analysis of Large Language Models for Medical Text Summarization.

OSF Preprints (2024) [20] Smith, J., et al.: Clinical Text Summarization with LLM-Based Evaluation. In: Proceedings of the

ACL 2024 Student Research Workshop (2024) [21] Van Veen, D., et al.: Adapted large language models can outperform medical experts in clinical text summarization. Nature Medicine (2024) [22] Van Veen, D., et al.: Adapted Large Language Models Can Outperform Medical Experts in Clinical

Text Summarization. arXiv preprint arXiv:2309.07430 (2023) [23] van Zandvoort, D., et al.: Enhancing Summarization Performance Through Transformer-Based Prompt Engineering in Automated Medical Reporting. In: Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies (2024) [24] Wallace, W., et al.: Evaluating large language models on medical evidence summarization. In:

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023) [25] Wolfe, C.R.: Automatic Prompt Optimization. Deep (Learning) Focus (2024) [26] Yang, Y., et al.: AMPO: Automatic Multi-Branched Prompt Optimization. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (2024) [27] Zaghir, J., et al.: Prompt engineering paradigms for medical applications: scoping review and recommendations for better practices. arXiv preprint arXiv:2404.12005 (2024)

A. Initial and Final Optimized Prompts

The following prompts show the evolution from a general, instruction-based prompt to the highly specific, role-playing prompt that was the final output of our optimization process (detailed in Algorithm 1).

A.1. Initial System Prompt

You a r e a c l i n i c a l d o c u m e n t a t i o n s p e c i a l i s t who c r e a t e s p r e c i s e c l i n i c a l summaries . Your t a s k i s t o c r e a t e a c o n c i s e summary o f t h e g i v e n c l i n i c a l c a s e r e p o r t t h a t : 1 . P r e s e r v e s ALL key d i a g n o s t i c i n f o r m a t i o n , t r e a t m e n t s , outcomes , and m e d i c a l f i n d i n g s 2 . M a i n t a i n s t h e o r i g i n a l m e d i c a l t e r m i n o l o g y and p h r a s i n g from t h e c a s e r e p o r t 3 . I n c l u d e s i m p o r t a n t c l i n i c a l d e t a i l s i n t h e same s e q u e n c e t h e y a p p e a r i n t h e o r i g i n a l 4 . Uses d i r e c t p h r a s e s from t h e o r i g i n a l t e x t whenever p o s s i b l e 5 . Avoids i n t r o d u c i n g new i n t e r p r e t a t i o n s or t e r m i n o l o g y not i n t h e o r i g i n a l r e p o r t Your summary s h o u l d be c o m p r e he n s i v e y e t c o n c i s e , f o c u s i n g on e x t r a c t i n g t h e most c l i n i c a l l y r e l e v a n t c o n t e n t .

A.2. Final Optimized System Prompt

∗ ∗ System Prompt ( I t e r a t i o n 18 − T o t a l R e i m a g i n i n g ) : ∗ ∗ You a r e a ∗ ∗ M e d i c a l Case Encoder v3 . 0 ∗ ∗ , a p r e c i s i o n − d r i v e n , r u l e − bound l a n g u a g e p r o c e s s o r d e s i g n e d t o ∗ ∗ f a i t h f u l l y r e c o n s t r u c t ∗ ∗ t h e most c l i n i c a l l y r e l e v a n t c o n t e n t from m e d i c a l c a s e r e p o r t s u s i n g ∗ ∗ s t r i c t v e r b a t i m e x t r a c t i o n ∗ ∗ . Your r o l e i s not t o i n t e r p r e t , i n f e r , or r e p h r a s e , b u t t o ∗ ∗ m i r r o r t h e s o u r c e t e x t with s u r g i c a l f i d e l i t y ∗ ∗ , e n s u r i n g ∗ ∗ e x a c t a l i g n m e n t ∗ ∗ i n ∗ ∗ sequence , t e r m i n o l o g y , and c l i n i c a l d e t a i l ∗ ∗ .

You a r e t o o p e r a t e i n ∗ ∗ s t r i c t e x t r a c t i o n mode ∗ ∗ , where ∗ ∗ o n l y c o n t e n t e x p l i c i t l y s t a t e d i n t h e o r i g i n a l t e x t ∗ ∗ i s i n c l u d e d . No i n f e r e n c e , no p a r a p h r a s i n g , no r e o r d e r i n g − ∗ ∗ o n l y d i r e c t e x t r a c t i o n ∗ ∗ o f ∗ ∗ c l i n i c a l f a c t s , p h r a s e s , and d a t a ∗ ∗ . You w i l l be g i v e n a ∗ ∗ m e d i c a l c a s e r e p o r t ∗ ∗ and a ∗ ∗ t a r g e t summary l e n g t h ∗ ∗ . Your o u t p u t must be a ∗ ∗ dense , v e r b a t i m − a l i g n e d summary ∗ ∗ t h a t i n c l u d e s ∗ ∗ o n l y t h e e x a c t p h r a s e s and s e n t e n c e s ∗ ∗ from t h e s o u r c e , a r r a n g e d i n t h e ∗ ∗ same o r d e r ∗ ∗ a s t h e y a p p e a r i n t h e o r i g i n a l .

You must ∗ ∗ s t r i c t l y i n c l u d e ∗ ∗ t h e f o l l o w i n g ∗ ∗ c o r e c l i n i c a l components ∗ ∗ , i n t h e ∗ ∗ e x a c t s e q u e n c e ∗ ∗ t h e y a p p e a r i n t h e o r i g i n a l :

A.3. User Prompt (Used with both system prompts)

C l i n i c a l Case R e p o r t : { } P l e a s e summarize t h i s c a s e r e p o r t i n { } , p r e s e r v i n g t h e key c l i n i c a l t e r m i n o l o g y and f o l l o w i n g t h e e x a c t same s t r u c t u r e a s t h e o r i g i n a l r e p o r t . I n c l u d e p a t i e n t demographics , m e d i c a l h i s t o r y , p r e s e n t i n g symptoms , d i a g n o s t i c f i n d i n g s , i n t e r v e n t i o n s , and outcomes . Use p h r a s e s d i r e c t l y from t h e o r i g i n a l t e x t whenever p o s s i b l e .

Length : 3 −5 s e n t e n c e s or a p p r o x i m a t e l y 100 −150 words . / n o _ t h i n k

B. Prompt Optimization History

The following figure and table detail the evolution of performance, measured by average ROUGE-L F1 score on a sample of the validation set, across the 41 iterations of our automated prompt optimization framework. The process is non-monotonic, as the "judge" LLM was encouraged to make radical changes, which sometimes resulted in a temporary decrease in performance before a better prompt was found. The final prompt used for our submissions was selected from iteration 18, which represented a strong peak before a period of instability.

Iter.

[1] Rodríguez-Ortega , M. , Rodríguez-Lopez , E. , Lima-López , S. , Escolano , C. , Melero , M. , Pratesi , L. , Vigil-Gimenez , L. , Fernandez , L. , Farré-Maduell , E. , Krallinger , M. : Overview of MultiClinSum task at BioASQ 2025: evaluation of clinical case summarization strategies for multiple languages: data, evaluation, resources and results . In: Faggioli, G. , Ferro , N. , Rosso , P. , Spina , D . (eds.) CLEF 2025 Working Notes. ( 2025 )

[2] Qwen

Team:

Qwen3 technical report . arXiv preprint arXiv:2405.09388 ( 2024 )