1. Introduction

O. d. Souza, H. R. Tabosa, D. M. d. Oliveira, M. H. d. S. Oliveira, Um método de sumarização automática de textos através de dados estatísticos e processamento de linguagem natural, In- formação amp;amp; Sociedade: Estudos

1946-1836

10.1109/CBMS52027.2021.00056

MedGemma-Sum-Pt: A Lightweight Model for Portuguese Clinical Summarization⋆

Elisa Terumi Rubel Schneider

elisa.rubel@pucpr.edu.br 0

Fernando Henrique Schneider

Emerson Cabrera Paraiso

paraiso@ppgia.pucpr.br 1

Alceu Souza Britto Jr

alceu@ppgia.pucpr.br 1

Rafael Menelau Oliveira Cruz

Rafael.Menelau-Cruz@etsmtl.ca 0 0 École de Technologie Supérieure (ÉTS), University of Quebec , 1100 Notre-Dame Street West, Montreal , Canada 1 Pontifícia Universidade Católica do Paraná (PUCPR) , Rua Imaculada Conceição, 1155, Curitiba , Brazil

2025

27 2017 474 479

Automatic summarization of clinical case reports is a challenging yet crucial task to support healthcare professionals in rapidly extracting relevant patient information. The scarcity of large-scale domain-specific datasets and pretrained models, combined with the linguistic complexity of medical texts, makes clinical summarization particularly challenging for low-resource languages like Portuguese and motivates the need for eficient adaptation strategies. This paper presents the approach developed by the ÉTS-PUCPR team for the Portuguese subtask of MultiClinSum, a multilingual clinical summarization shared-task. Our methodology explores (i) zero-shot prompting with general-purpose instruction-following models, and (ii) supervised fine-tuning using LoRA, a parameter-eficient fine-tuning technique, on the biomedical language model MedGemma. We compare these strategies to assess their efectiveness for clinical summarization in Portuguese. Our results demonstrated competitive performance in internal evaluations, particularly when compared to zero-shot baseline performances, showing strong semantic similarity with expert summaries as measured by BERTScore. Despite limitations such as a relatively small training dataset, our findings highlight the potential of fine-tuning domain-specific models under resource constraints for low-resource clinical summarization tasks.

eol>Large language model NLP summarization clinical cases fine-tuning

1. Introduction

The increasing availability of multilingual clinical case reports opens new opportunities for developing automatic summarization systems that can assist healthcare professionals in eficiently accessing critical patient information. However, clinical summarization remains a challenging task. Medical texts are complex, sensitive, and often written in highly variable formats. These challenges are amplified in languages with fewer resources, such as Portuguese, where annotated datasets and domain-adapted models are still scarce [ 1 ] [ 2 ] .

The MultiClinSum task [ 3 ], part of the BioASQ Lab at CLEF 2025 [ 4 ], addresses these challenges by evaluating systems for automatic summarization of clinical case reports in multiple languages. The task encourages the development of models capable of generating concise, coherent, and medically meaningful summaries across linguistic contexts.

In this paper, we present the approach developed by the ÉTS-PUCPR team for the Portuguese subtask of the MultiClinSum challenge. We explored two strategies: (i) zero-shot prompting using generalpurpose instruction-following models, and (ii) supervised fine-tuning on the clinical summarization data released by the challenge organizers. For the latter, we adopt a parameter-eficient fine-tuning approach using LoRA on top of MedGemma [5], a multilingual biomedical language model. All experiments are conducted in a resource-constrained environment to reflect realistic deployment conditions in clinical or academic settings.

This work introduces MedGemma-Sum-Pt, a domain-adapted model for summarization in a lowresource language. Our results align with previous studies showing that fine-tuning on in-domain data, such as clinical narratives in Portuguese, tend to outperform general-purpose models due to linguistic complexity and domain-specific terminology [ 2 ] [6] [7].

To foster further research, we release our fine-tuned model, addressing the current gap of summarization models for clinical Portuguese: https://huggingface.co/pucpr-br/medgemma-pt-finetunedmulticlinsum.

2. Task Description

The MultiClinSum task, introduced as part of the BioASQ Lab at CLEF 2025, aims to evaluate automatic summarization systems applied to multilingual clinical case reports. Participants are required to generate concise summaries from real-world clinical narratives written in diferent languages, including English, Spanish, French, and Portuguese [ 3 ].

The task is framed as a summarization problem, where systems must produce short, fluent, and medically accurate summaries that capture the essential information from the original clinical text. Each instance consists of a clinical case report as input and a reference summary manually written by domain experts.

The dataset released for the MultiClinSum task, covering all languages, is divided into three subsets: • A gold-standard set, comprising 592 clinical case reports with summaries manually written and reviewed by medical experts, which serve as high-quality ground truth for training and evaluation; • A large-scale set, containing 25,902 clinical cases also accompanied by summaries. However, unlike the gold set, these summaries are not expert-verified and may vary in quality; • A test set with 3,442 clinical cases, used for the oficial evaluation of system performance. The summaries in this set are withheld from participants to ensure unbiased assessment. Evaluation is conducted using a combination of automatic metrics, such as ROUGE and BERTScore.

ROUGE [8] measures the lexical overlap between the generated and reference texts, providing an estimate of informativeness based on shared n-grams. It is defined as follows: BERT = BERT = 1 ∑︁ max e⊤eˆ |ˆ| ^∈^ ∈ 1 ∑︁ max e⊤eˆ || ∈ ^∈^ ROUGE-N =

∑︀ ∑︀ ∈{ReferenceSummaries} ∈

Countmatch()

Count() where is the length of the n-gram, Countmatch() is the maximum number of n-grams cooccurring in both the candidate summary and the reference summaries, and Count() is the total number of n-grams in the reference summaries.

BERTScore [9] evaluates semantic similarity by comparing contextualized token embeddings from a pre-trained transformer model. Given a candidate summary ˆ = {ˆ1, . . . , ˆ} and a reference summary = {1, . . . , }, we obtain corresponding normalized embeddings {eˆ } and {e}, respectively. BERTScore computes token-level cosine similarity using the inner product of embeddings, and derives precision, recall, and F1 as follows: (1) (2) (3) BERTScoreF1 = 2 × BERT × BERT BERT + BERT (4)

The multilingual nature of the task also encourages research on cross-lingual and language-specific modeling approaches in the clinical domain.

3. Related Work

Few studies have explored LLM-based summarization of clinical texts in Portuguese, a low-resource language that presents several unique challenges. The scarcity of publicly available clinical corpora, complex morphology, frequent use of jargon, acronyms, and institution-specific abbreviations, as well as dialectal variation, all hinder the development of robust NLP tools for tasks such as summarization [ 1 ] [ 2 ] [10]. Despite the strong performance of large language models in condensing lengthy medical texts while preserving clinically relevant information, their efectiveness in Portuguese clinical texts remains largely underexplored.

Moreover, Portuguese exhibits rich morphology (e.g., gender and number inflections, dropped pronouns), requiring careful natural language generation. Regional variations between European and Brazilian Portuguese further afect vocabulary and syntax. Despite these challenges, initial eforts have shown that specialized models for clinical Portuguese achieve competitive results, indicating the potential of AI solutions for clinical NLP in Portuguese.

A study [6] investigated abstractive summarization for Brazilian Portuguese using deep learningbased approaches. While their results were promising, the study highlighted ongoing challenges related to coherence, grammar, and fluency. Although their work constitutes a preliminary exploration of abstractive summarization in Brazilian Portuguese using neural models, it did not address domainspecific applications such as clinical narratives. Other studies have also highlighted summarization research for Portuguese texts, such as [11] and [12], though these works focus on general-domain texts rather than clinical narratives.

On the clinical domain, a study on summarization for Brazilian Portuguese [ 1 ] explored diferent approaches applied to electronic health records of chronic disease patients, including an unsupervised neural model based on fine-tuned BERT, as well as supervised methods using sequence labeling and dictionary-based techniques. To reduce redundancy in the generated summaries, a semantic similarity method based on Siamese Neural Networks was employed. Results showed that supervised methods achieved better performance, particularly in preserving clinically relevant information, highlighting the importance of domain-specific resources for efective summarization in Portuguese.

Another study on automatic text summarization in Portuguese [13] compared six algorithms, including classical methods (e.g., Luhn), modern neural models (ChatGPT), and a custom-designed approach called the Marques algorithm. The evaluation, conducted on a COVID-19-related document, revealed that the Marques algorithm outperformed others in precision, coherence, cohesion, and processing time. Although not focused on clinical data, this study highlights important considerations for summarization in Portuguese, such as the benefits of domain-adapted models, the importance of comprehensive evaluation (including human assessment), and the challenges of adapting summarization strategies to diferent text types.

Although several clinical domain language models have been trained in Portuguese, such as MEDLLM-BR [14], BioBERTpt [ 2 ], CardioBERTpt [10], gpt2-bio-pt [7], and DepreBERTBR [15], our investigation did not identify any large language model specifically designed or fine-tuned for the task of clinical summarization in Portuguese.

In this work, we address this gap by fine-tuning MedGemma with LoRA on Portuguese clinical summarization data, demonstrating improved performance in generating semantically aligned summaries under realistic resource constraints. Our approach enables eficient domain adaptation and contributes to the community with a publicly available fine-tuned model for Portuguese clinical summarization.

4. Methodology

To address the Portuguese summarization subtask in MultiClinSum, we experimented with two approaches:

(1) Zero-shot prompting, where the model receives only the input text and an instruction, without seeing any examples; and

(2) Supervised fine-tuning, where the model is explicitly trained on a labeled dataset to optimize its summarization performance.

These strategies are illustrated in Figure 1 and form the basis of the experiments presented in this work1.

4.1. Zero-Shot Prompting

Our initial experiments focused on evaluating the performance of various open-source language models in a zero-shot setting, to assess their ability to summarize clinical case reports in Portuguese without additional training.

We selected five instruction-following models for this stage, all of which are accessed via the Unsloth repository [16] with 4-bit quantization to enable eficient inference on limited hardware. All selected models are multilingual and capable of understanding and generating text in Portuguese. • Gemma-7B-Instruct [17]: A general-purpose instruction-tuned version of the Gemma 7B model, optimized for dialogue and reasoning tasks. 1Few-shot prompting was initially considered but was discarded due to input truncation caused by the limited context window of the models.

• Llama-3.1-8B-Instruct [18]: An instruction-tuned variant of LLaMA 3.1 with 8 billion parameters, ofering high performance on a wide range of reasoning tasks. • Llama-3.2-3B-Instruct [18]: A lightweight version of LLaMA 3.2 (3B), chosen for its small footprint and fast inference capabilities. • Medgemma-4B-Instruct [5]: A smaller, biomedical-focused variant of Gemma, tuned on clinical and health-related instructions. • Qwen2.5-7B-Instruct [19]: A 7-billion parameter instruction-tuned model based on the Qwen 2.5 architecture, designed for general-purpose reasoning and generation tasks with eficient performance.

These models were selected based on a balance between size, instruction-following capabilities, and accessibility. Our goal was to prioritize lightweight models (3B to 8B parameters) that could be easily deployed on modest hardware, such as edge devices or memory-constrained servers.

Small and mid-sized models typically demonstrate lower performance compared to large-scale models across a wide range of evaluation metrics and benchmarks. In a recent comparison involving 30 benchmark tasks, GPT-4 outperformed lightweight models (with up to 8B parameters) in 26 cases, based on metrics such as ROUGE, accuracy, and task-specific scores [20].

While larger models might ofer superior summarization quality, our focus was on finding eficient and practical solutions suitable for researchers and users with limited computational resources, particularly in clinical or academic settings where access to large-scale infrastructure may be constrained.

Smaller models enable real-time summarization directly within healthcare facilities or on portable devices, facilitating faster decision-making and improving patient care without requiring expensive or complex infrastructure.

We used the FastLanguageModel.from_pretrained interface from Unsloth to load each model with 4-bit quantization. The models were configured with a context window of 8192 tokens and a generation limit of 512 new tokens. This setup allowed eficient testing without requiring access to high-end GPUs. 4.1.1. Prompt Design All models were tested using the same prompt, designed to reflect the structure and content expected in a professional clinical summary.

Given the central role of instruction prompts, especially in zero-shot scenarios with general-purpose models, we paid particular attention to the design and refinement of our prompt for this task. In the absence of in-context examples, the prompt alone defines the structure, tone, and granularity of the expected output.

During early experiments, we tested multiple prompt formulations to evaluate their efect on model behavior. Minimal or short prompts (e.g., “Summarize the following clinical case”) often produced generic outputs lacking key clinical details. Moreover, without an explicit instruction to reduce the input length, the models tended to generate overly long summaries. The requirement of a 75% reduction in text length helped constrain verbosity and improved focus on relevant findings.

The selected prompt struck a balance between guidance and conciseness. It proved robust across diferent model architectures and sizes (e.g., 3B to 8B parameters), and consistently improved informativeness and coherence compared to shorter or less explicit alternatives. To emulate expert-authored case summaries, we also encouraged a chronologically coherent narrative that preserves the logical lfow of clinical reasoning. We observed that models were particularly sensitive to the presence (or absence) of directives regarding summary length and content types.

The prompt was written in Portuguese to ensure alignment with the language of the input cases: Você é um especialista em medicina clínica. Leia o caso clínico e produza um resumo clínico técnico, objetivo, conciso, claro e profissional com as seguintes características: - Faça uma redução de pelo menos 75% do conteúdo original, mantendo a essência do caso. O texto gerado deve ter entre 5 a 10 frases e entre 50 a 150 palavras. - Identifique dados principais do paciente (idade, sexo, antecedentes importantes). - Descreva os sintomas e sinais apresentados, com tempo de evolução. - Resuma os exames relevantes e os achados principais. - Aponte as hipóteses diagnósticas e diagnóstico final, se houver. - Descreva a conduta realizada e a evolução do paciente. - Escreva um parágrafo coeso e conciso contendo essas informações principais. Não escreva em tópicos, mas em texto fluido.

Escreva com clareza, precisão e fluidez técnica, como um profissional da área da saúde humana, em um texto descritivo com narrativa cronológica simples.

Siga as instruções e forneça o resumo no final.

The English translation of the prompt is provided below for clarity and accessibility.

You are a specialist in clinical medicine. Read the clinical case and produce a technical, objective, concise, clear, and professional clinical summary with the following characteristics: - Reduce the original content by at least 75%, keeping the essence of the case. The generated text should contain between 5 to 10 sentences and between 50 to 150 words. - Identify key patient data (age, sex, relevant medical history). - Describe symptoms and signs presented, including time of onset/evolution. - Summarize relevant exams and main findings. - Indicate diagnostic hypotheses and final diagnosis, if available. - Describe the treatment and the patient’s clinical outcome. - Write a cohesive and concise paragraph containing this core information. Do not use bullet points, but rather a fluid narrative.

Write clearly, precisely, and with technical fluency, as a healthcare professional would, using a simple chronological structure.

Follow the instructions and provide the summary at the end.

The final prompt explicitly instructs the model to produce concise, fluent, and structured summaries emulating clinical documentation style. It specifies essential elements to include, such as patient demographics, symptom evolution, diagnostic reasoning, relevant exams, and treatment outcomes. It also imposes constraints on summary length and discourages bulleted or disjointed text. This design was guided by two main goals: (i) ensuring the inclusion of clinically relevant information; and (ii) enforcing a professional tone and format aligned with medical communication standards.

4.2. Supervised Fine-Tuning

Building on insights from zero-shot evaluation, we fine-tuned MedGemma [ 5], a domain-adapted language model specialized in clinical and biomedical text generation, to develop a new model named MedGemma-Sum-Pt.

The model was fine-tuned to better adapt to the task based on the provided training data and a fixed prompt format. We used the same instruction prompt as in the zero-shot experiment, to ensure that the model internalize the expected output structure, reducing the risk of format mismatch or unpredictable behavior during summarization. The fine-tuning was performed exclusively on the gold-standard dataset provided by the MultiClinSum organizers, containing high-quality expert summaries essential for training accurate clinical summarization models.

Our training strategy utilized a parameter-eficient fine-tuning (PEFT) technique, LoRA (Low-Rank Adaptation) [21], to eficiently update a subset of model parameters. Unlike traditional fine-tuning, which updates all model weights and requires substantial computational resources, LoRA introduces trainable low-rank matrices into specific layers (e.g., attention and language layers) while keeping the original model weights frozen. This significantly reduces the number of trainable parameters and accelerates training. As illustrated in Figure 2, this approach enables a compact and eficient adaptation process.

We configured LoRA with a rank of = 8, which balances parameter eficiency and expressiveness, as evidenced by [21]. Adaptation modules were injected into the query and value projection matrices within the attention layers of the transformer, which have been shown to be particularly efective for ifne-tuning language models on downstream NLP tasks. Since the task is purely textual, all visionrelated components of the architecture were disabled. Additionally, 4-bit quantization was employed to optimize memory usage during training, allowing training on modest hardware without significantly degrading model accuracy.

Training was conducted using the SFTTrainer framework [22], with the following hyperparameters: a per-device batch size of 4 with gradient accumulation over 4 steps, a learning rate of 2e-4, and a total of 3 training epochs (375 optimization steps). To reduce memory consumption, we used the adamw_8bit optimizer, combined with a linear learning rate scheduler and a brief warm-up phase of 5 steps. The entire training process took approximately 6.4 hours on a single NVIDIA T4 GPU (16 GB), with a peak memory usage of around 11 GB. This demonstrates the feasibility of fine-tuning using afordable, widely accessible hardware rather than relying on high-end infrastructure.

4.3. Evaluation Strategy

To assess the quality of summaries produced by both zero-shot prompting and fine-tuned generation, we set aside the first 50 examples from our 592-report gold-standard training set. These 50 cases were excluded from fine-tuning and used solely as an internal validation set for comparative evaluation.

We adopted the automatic evaluation metrics proposed by the MultiClinSum task organizers: ROUGE and BERTScore, which provide surface-level and semantic similarity measurements, respectively. For BERTScore, we used the biobertpt-all model [ 2 ], a Portuguese clinical and biomedical BERT model, ensuring domain and language alignment with the evaluation setting.

In our internal evaluation, we consider the performance of instruction-tuned models in the zero-shot setting as baselines, against which we compare improvements achieved through supervised fine-tuning.

5. Results and Discussion 5.1. Internal Evaluation Results

We evaluated both strategies, zero-shot prompting (our baselines) and supervised fine-tuning (MedGemma-Sum-Pt), on our internal validation set of 50 gold-standard examples held out from training. Table 1 summarizes the corresponding ROUGE and BERTScore results.

As we can see from the internal evaluation results, the zero-shot models achieved reasonable performance when guided by a well-crafted prompt, especially the Llama-3.1-8B-Instruct, which slightly outperformed others in terms of ROUGE and BERTScore. Despite the Llama-3.1-8B-Instruct model’s slightly better zero-shot performance, we chose to fine-tune MedGemma due to its smaller size (4B) and native biomedical specialization. Our goal was to develop a lightweight yet efective model that could be fine-tuned and deployed with limited computational resources. Even without fine-tuning, MedGemma showed competitive results, and its domain adaptation made it a natural candidate for clinical tasks.

While the zero-shot models demonstrated competitive results with well-designed prompts, our ifne-tuned MedGemma-Sum-Pt model consistently outperformed all baselines across all metrics. The improvement in ROUGE-2 and ROUGE-L scores suggests that fine-tuning helped the model better capture both local and structural coherence of the summaries. Likewise, the increase in BERTScore F1 demonstrates an enhanced semantic alignment with the expert references.

To illustrate, we selected an example from our test set in which the fine-tuned model achieved 0.43 in ROUGE-L and 0.776 in BERTScore, while the zero-shot model obtained 0.26 in ROUGE-L and 0.668 in BERTScore (as shown in Figure 3). In this case, the fine-tuned model demonstrated superior lexical and semantic similarity to the reference summary. Qualitatively, the summary generated by the ifne-tuned model more accurately preserved essential clinical information, such as the patient’s age, renal transplant history, absence of CMV infection prior to the episode, decreased visual acuity in the left eye, and treatment with ganciclovir. In contrast, the zero-shot model included content not present in the reference summary, such as bilateral eye involvement (whereas the summary refers only to the left eye) and herpes simplex infection. In this example, the fine-tuned model produced a more concise and focused summary that preserved the case’s key points, which explains its higher metric scores.

Across other examples, common zero-shot errors included additional clinical information not present in the reference summaries, omission of relevant lab findings, and inconsistent clinical timelines. In contrast, the fine-tuned model showed higher fidelity to the reference data.

This stage helped us identify the limitations of zero-shot summarization with compact models, motivating the transition to a fine-tuning approach. Based on these findings, we selected the MedGemma-Sum-Pt model, our best-performing approach, for the oficial submission to the MultiClinSum task.

5.2. Oficial Results

In this section, we present the oficial results of our system on the shared task.

Based on our internal evaluation, we selected our fine-tuned MedGemma-Sum-Pt model as our oficial submission to the Portuguese track of the MultiClinSum task. The model demonstrated the most consistent performance across all evaluation metrics when compared to zero-shot baselines.

The submitted system was evaluated on the hidden test set by the task organizers using the oficial metrics: ROUGE and BERTScore. Our oficial results are shown in Table 2.

These results show that the model maintained a strong semantic alignment with the gold summaries (as reflected by the BERTScore), although lexical overlap (measured by ROUGE) was more modest. This suggests frequent paraphrasing or alternative expressions, which is common in clinical language. Since ROUGE relies on n-gram overlap, it may underestimate quality when summaries are semantically correct but lexically diverse.

At the time of writing, comparative results from other participants have not been made publicly available, which limits a more detailed contextualization of our system’s performance within the competition.

5.3. Discussion

Our findings suggest that even small instruction-tuned models can perform competitively in clinical summarization tasks when fine-tuned with domain-specific data and guided by carefully crafted prompts. The MedGemma-Sum-Pt model delivered strong semantic results with minimal infrastructure, highlighting the feasibility of low-cost clinical NLP solutions.

Our model demonstrated close alignment with expert-authored summaries in terms of meaning and clinical relevance, as indicated by the high BERTScore. This metric captures meaning similarity beyond exact word matches, suggesting that MedGemma-Sum-Pt was able to convey clinical information reliably even when using diferent phrasing. The BERTScore results include precision, recall, and F1 metrics, which together provide a more nuanced understanding of model performance. In our results, the close values of precision and recall suggest that MedGemma-Sum-Pt maintains a good balance between generating accurate and comprehensive clinical summaries.

In contrast, ROUGE scores were more modest, reflecting less lexical overlap. This discrepancy may be explained by the abstractive nature of the model’s outputs, which often generate paraphrases or reformulations that preserve meaning but diverge lexically from the reference summaries. Additionally, the model may have learned to prioritize semantic coherence over surface-level n-gram overlap, which aligns with the goals of clinical summarization but can penalize ROUGE scores. This behavior is consistent with findings in the literature, such as [ 23], where abstractive models evaluated on clinical texts in English showed similarly low ROUGE scores despite strong semantic fidelity as captured by BERTScore. The persistence of this pattern in our Portuguese-language setting suggests that the trade-of between lexical overlap and semantic alignment may be a broader characteristic of abstractive summarization.

Furthermore, these findings highlight the importance of combining lexical and semantic metrics for evaluating clinical summarization, where preserving meaning is paramount.

Our findings also underscore the central role of instruction prompts in clinical summarization with language models. The prompt used in this work was carefully crafted to guide model behavior and ensure consistency, particularly in zero-shot settings. While efective, this manually designed prompt may limit generalization to other formats or clinical domains. Future work should explore systematic prompt optimization or adaptation techniques, including prompt tuning and evaluation of diverse prompt formulations, to reduce reliance on handcrafted instructions and improve robustness across diverse inputs.

Overall, this work presents a practical approach to clinical summarization in Portuguese using parameter-eficient fine-tuning. While our setup prioritizes accessibility and resource eficiency, further studies comparing performance with larger models are needed to fully assess trade-ofs in summarization quality. These findings contribute to ongoing eforts toward scalable, language-specific NLP solutions in the clinical domain.

6. Conclusions

This work presented a practical approach for clinical summarization in Portuguese by fine-tuning a compact, biomedical-specialized model, MedGemma. We showed that even with limited computational resources and a relatively small training dataset, it is possible to achieve competitive results, particularly in semantic alignment with expert summaries.

Our internal evaluation indicated the advantage of fine-tuning over zero-shot strategies, highlighting the importance of domain adaptation for clinical NLP tasks. Furthermore, the combined use of BERTScore and ROUGE metrics provided a complementary view of summary quality, capturing both semantic fidelity and lexical overlap.

Our participation in the MultiClinSum challenge underscores the potential of compact, domainadapted models for multilingual clinical summarization and reinforces their viability for real-world deployment in medical settings.

7. Limitations

We acknowledge limitations related to the size of the dataset used for training and evaluation scope. Future work should explore larger datasets and more comprehensive external evaluations. Specifically, we plan to incorporate the larger Portuguese clinical dataset (25,902 examples) into the training process to assess whether a broader data foundation improves summarization quality and robustness. Moreover, while our results show promise, the current evaluation does not include direct comparisons with other participating systems in the MultiClinSum challenge. As oficial rankings and results become available, we intend to conduct a more thorough comparative analysis to better contextualize our model’s performance.

We also observed sensitivity to prompt phrasing during generation, highlighting the need for systematic prompt optimization strategies or prompt tuning techniques to reduce variability in output quality. This issue was particularly evident in our zero-shot experiments.

Acknowledgments

This work was supported by the Brazilian National Council for Scientific and Technological Development (CNPq), under Project 441610/2023-4. And the Natural Sciences and Engineering Research Council of Canada (NSERC) discovery grant program (RGPIN-2021-04130).

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT to assist with grammar and spelling checks. After using this service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. To support further research in clinical natural language processing, particularly within Portugueselanguage healthcare contexts, our MedGemma-Sum-Pt model is publicly available on Hugging Face at https://huggingface.co/pucpr-br/medgemma-pt-finetuned-multiclinsum, released under a research-only license.

As it was built upon the original MedGemma model, its use is restricted to academic and noncommercial purposes, in accordance with the original licensing constraints.

[1]

L. E. S. e.

Oliveira , C. M. C. M. Barra , S. S. A. Hasan , Assembling natural language processing resources to perform the summarization of clinical narratives , Tese (doutorado) , Pontifícia Universidade Católica do Paraná, Curitiba , 2020 . URL: https://archivum.grupomarista.org.br/ pergamumweb/vinculos/000094/0000943a.pdf, acesso em: 21 dez . 2020 .

[2]

E. T. R.

Schneider ,

J. V.

A. de Souza ,

Knafou ,

L. E. S. e.

Oliveira ,

Copara ,

Y. B.

Gumiel ,

L. F. A. d.

Oliveira ,

E. C.

Paraiso ,

Teodoro , C. M. C. M. Barra , BioBERTpt - a Portuguese neural language model for clinical named entity recognition , in: Proceedings of the 3rd Clinical Natural Language Processing Workshop , Association for Computational Linguistics, Online, 2020 , pp. 65 - 72 . URL: https://www.aclweb.org/anthology/2020.clinicalnlp- 1 .7.

[3]

Rodríguez-Ortega ,

Rodríguez-Lopez ,

Lima-López ,

Escolano ,

Melero ,

Pratesi ,

Vigil-Gimenez ,

Fernandez ,

Farré-Maduell , M. Krallinger, Overview of MultiClinSum task at BioASQ 2025: evaluation of clinical case summarization strategies for multiple languages: data, evaluation, resources and results ., in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), CLEF 2025 Working Notes , 2025 .

[4]

Nentidis ,

Katsimpras ,

Krithara ,

Krallinger ,

Rodríguez-Ortega ,

Rodriguez-López ,

Loukachevitch ,

Sakhovskiy ,

Tutubalina ,

Dimitriadis , G. Tsoumakas,

Giannakoulas ,

Bekiaridou ,

Samaras ,

F. N. Maria

Di Nunzio , Giorgio,

Marchesin ,

Martinelli , G. Silvello, G. Paliouras, Overview of BioASQ 2025 : The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering , in: L. P. A. G. S. d. H. J. M. F. P. P. R. D. S. G. F. N. F. Jorge Carrillo-de Albornoz , Julio Gonzalo (Ed.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025 ), 2025 .