<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>O. d. Souza, H. R. Tabosa, D. M. d. Oliveira, M. H. d. S. Oliveira, Um método de sumarização
automática de textos através de dados estatísticos e processamento de linguagem natural, In-
formação amp;amp; Sociedade: Estudos</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1946-1836</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/CBMS52027.2021.00056</article-id>
      <title-group>
        <article-title>MedGemma-Sum-Pt: A Lightweight Model for Portuguese Clinical Summarization⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elisa Terumi Rubel Schneider</string-name>
          <email>elisa.rubel@pucpr.edu.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fernando Henrique Schneider</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emerson Cabrera Paraiso</string-name>
          <email>paraiso@ppgia.pucpr.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alceu Souza Britto Jr</string-name>
          <email>alceu@ppgia.pucpr.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael Menelau Oliveira Cruz</string-name>
          <email>Rafael.Menelau-Cruz@etsmtl.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>École de Technologie Supérieure (ÉTS), University of Quebec</institution>
          ,
          <addr-line>1100 Notre-Dame Street West, Montreal</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pontifícia Universidade Católica do Paraná (PUCPR)</institution>
          ,
          <addr-line>Rua Imaculada Conceição, 1155, Curitiba</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>27</volume>
      <issue>2017</issue>
      <fpage>474</fpage>
      <lpage>479</lpage>
      <abstract>
        <p>Automatic summarization of clinical case reports is a challenging yet crucial task to support healthcare professionals in rapidly extracting relevant patient information. The scarcity of large-scale domain-specific datasets and pretrained models, combined with the linguistic complexity of medical texts, makes clinical summarization particularly challenging for low-resource languages like Portuguese and motivates the need for eficient adaptation strategies. This paper presents the approach developed by the ÉTS-PUCPR team for the Portuguese subtask of MultiClinSum, a multilingual clinical summarization shared-task. Our methodology explores (i) zero-shot prompting with general-purpose instruction-following models, and (ii) supervised fine-tuning using LoRA, a parameter-eficient fine-tuning technique, on the biomedical language model MedGemma. We compare these strategies to assess their efectiveness for clinical summarization in Portuguese. Our results demonstrated competitive performance in internal evaluations, particularly when compared to zero-shot baseline performances, showing strong semantic similarity with expert summaries as measured by BERTScore. Despite limitations such as a relatively small training dataset, our findings highlight the potential of fine-tuning domain-specific models under resource constraints for low-resource clinical summarization tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large language model</kwd>
        <kwd>NLP</kwd>
        <kwd>summarization</kwd>
        <kwd>clinical cases</kwd>
        <kwd>fine-tuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The increasing availability of multilingual clinical case reports opens new opportunities for developing
automatic summarization systems that can assist healthcare professionals in eficiently accessing critical
patient information. However, clinical summarization remains a challenging task. Medical texts are
complex, sensitive, and often written in highly variable formats. These challenges are amplified in
languages with fewer resources, such as Portuguese, where annotated datasets and domain-adapted
models are still scarce [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] .
      </p>
      <p>
        The MultiClinSum task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], part of the BioASQ Lab at CLEF 2025 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], addresses these challenges by
evaluating systems for automatic summarization of clinical case reports in multiple languages. The
task encourages the development of models capable of generating concise, coherent, and medically
meaningful summaries across linguistic contexts.
      </p>
      <p>In this paper, we present the approach developed by the ÉTS-PUCPR team for the Portuguese subtask
of the MultiClinSum challenge. We explored two strategies: (i) zero-shot prompting using
generalpurpose instruction-following models, and (ii) supervised fine-tuning on the clinical summarization data
released by the challenge organizers. For the latter, we adopt a parameter-eficient fine-tuning approach
using LoRA on top of MedGemma [5], a multilingual biomedical language model. All experiments are
conducted in a resource-constrained environment to reflect realistic deployment conditions in clinical
or academic settings.</p>
      <p>
        This work introduces MedGemma-Sum-Pt, a domain-adapted model for summarization in a
lowresource language. Our results align with previous studies showing that fine-tuning on in-domain data,
such as clinical narratives in Portuguese, tend to outperform general-purpose models due to linguistic
complexity and domain-specific terminology [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [6] [7].
      </p>
      <p>To foster further research, we release our fine-tuned model, addressing the current gap of
summarization models for clinical Portuguese:
https://huggingface.co/pucpr-br/medgemma-pt-finetunedmulticlinsum.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description</title>
      <p>
        The MultiClinSum task, introduced as part of the BioASQ Lab at CLEF 2025, aims to evaluate automatic
summarization systems applied to multilingual clinical case reports. Participants are required to generate
concise summaries from real-world clinical narratives written in diferent languages, including English,
Spanish, French, and Portuguese [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The task is framed as a summarization problem, where systems must produce short, fluent, and
medically accurate summaries that capture the essential information from the original clinical text.
Each instance consists of a clinical case report as input and a reference summary manually written by
domain experts.</p>
      <p>The dataset released for the MultiClinSum task, covering all languages, is divided into three subsets:
• A gold-standard set, comprising 592 clinical case reports with summaries manually written and
reviewed by medical experts, which serve as high-quality ground truth for training and evaluation;
• A large-scale set, containing 25,902 clinical cases also accompanied by summaries. However,
unlike the gold set, these summaries are not expert-verified and may vary in quality;
• A test set with 3,442 clinical cases, used for the oficial evaluation of system performance. The
summaries in this set are withheld from participants to ensure unbiased assessment.
Evaluation is conducted using a combination of automatic metrics, such as ROUGE and BERTScore.</p>
      <p>ROUGE [8] measures the lexical overlap between the generated and reference texts, providing an
estimate of informativeness based on shared n-grams. It is defined as follows:
BERT =
BERT =
1 ∑︁ max e⊤eˆ
|ˆ| ^∈^ ∈
1 ∑︁ max e⊤eˆ
|| ∈ ^∈^
ROUGE-N =</p>
      <p>∑︀ ∑︀
∈{ReferenceSummaries} ∈</p>
      <p>∑︀ ∑︀
∈{ReferenceSummaries} ∈</p>
      <p>Countmatch()</p>
      <p>Count()
where  is the length of the n-gram, Countmatch() is the maximum number of n-grams
cooccurring in both the candidate summary and the reference summaries, and Count() is the total
number of n-grams in the reference summaries.</p>
      <p>BERTScore [9] evaluates semantic similarity by comparing contextualized token embeddings from a
pre-trained transformer model. Given a candidate summary ˆ = {ˆ1, . . . , ˆ} and a reference summary
 = {1, . . . , }, we obtain corresponding normalized embeddings {eˆ } and {e}, respectively.
BERTScore computes token-level cosine similarity using the inner product of embeddings, and derives
precision, recall, and F1 as follows:
(1)
(2)
(3)
BERTScoreF1 =
2 × BERT × BERT
BERT + BERT
(4)</p>
      <p>The multilingual nature of the task also encourages research on cross-lingual and language-specific
modeling approaches in the clinical domain.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Related Work</title>
      <p>
        Few studies have explored LLM-based summarization of clinical texts in Portuguese, a low-resource
language that presents several unique challenges. The scarcity of publicly available clinical corpora,
complex morphology, frequent use of jargon, acronyms, and institution-specific abbreviations, as well
as dialectal variation, all hinder the development of robust NLP tools for tasks such as summarization
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [10]. Despite the strong performance of large language models in condensing lengthy medical
texts while preserving clinically relevant information, their efectiveness in Portuguese clinical texts
remains largely underexplored.
      </p>
      <p>Moreover, Portuguese exhibits rich morphology (e.g., gender and number inflections, dropped
pronouns), requiring careful natural language generation. Regional variations between European and
Brazilian Portuguese further afect vocabulary and syntax. Despite these challenges, initial eforts
have shown that specialized models for clinical Portuguese achieve competitive results, indicating the
potential of AI solutions for clinical NLP in Portuguese.</p>
      <p>A study [6] investigated abstractive summarization for Brazilian Portuguese using deep
learningbased approaches. While their results were promising, the study highlighted ongoing challenges related
to coherence, grammar, and fluency. Although their work constitutes a preliminary exploration of
abstractive summarization in Brazilian Portuguese using neural models, it did not address
domainspecific applications such as clinical narratives. Other studies have also highlighted summarization
research for Portuguese texts, such as [11] and [12], though these works focus on general-domain texts
rather than clinical narratives.</p>
      <p>
        On the clinical domain, a study on summarization for Brazilian Portuguese [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] explored diferent
approaches applied to electronic health records of chronic disease patients, including an unsupervised
neural model based on fine-tuned BERT, as well as supervised methods using sequence labeling and
dictionary-based techniques. To reduce redundancy in the generated summaries, a semantic similarity
method based on Siamese Neural Networks was employed. Results showed that supervised methods
achieved better performance, particularly in preserving clinically relevant information, highlighting
the importance of domain-specific resources for efective summarization in Portuguese.
      </p>
      <p>Another study on automatic text summarization in Portuguese [13] compared six algorithms,
including classical methods (e.g., Luhn), modern neural models (ChatGPT), and a custom-designed approach
called the Marques algorithm. The evaluation, conducted on a COVID-19-related document, revealed
that the Marques algorithm outperformed others in precision, coherence, cohesion, and processing time.
Although not focused on clinical data, this study highlights important considerations for
summarization in Portuguese, such as the benefits of domain-adapted models, the importance of comprehensive
evaluation (including human assessment), and the challenges of adapting summarization strategies to
diferent text types.</p>
      <p>
        Although several clinical domain language models have been trained in Portuguese, such as
MEDLLM-BR [14], BioBERTpt [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], CardioBERTpt [10], gpt2-bio-pt [7], and DepreBERTBR [15], our
investigation did not identify any large language model specifically designed or fine-tuned for the task of
clinical summarization in Portuguese.
      </p>
      <p>In this work, we address this gap by fine-tuning MedGemma with LoRA on Portuguese clinical
summarization data, demonstrating improved performance in generating semantically aligned summaries
under realistic resource constraints. Our approach enables eficient domain adaptation and contributes
to the community with a publicly available fine-tuned model for Portuguese clinical summarization.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>To address the Portuguese summarization subtask in MultiClinSum, we experimented with two
approaches:</p>
      <p>(1) Zero-shot prompting, where the model receives only the input text and an instruction, without
seeing any examples; and</p>
      <p>(2) Supervised fine-tuning, where the model is explicitly trained on a labeled dataset to optimize its
summarization performance.</p>
      <p>These strategies are illustrated in Figure 1 and form the basis of the experiments presented in this
work1.</p>
      <sec id="sec-4-1">
        <title>4.1. Zero-Shot Prompting</title>
        <p>Our initial experiments focused on evaluating the performance of various open-source language models
in a zero-shot setting, to assess their ability to summarize clinical case reports in Portuguese without
additional training.</p>
        <p>We selected five instruction-following models for this stage, all of which are accessed via the Unsloth
repository [16] with 4-bit quantization to enable eficient inference on limited hardware. All selected
models are multilingual and capable of understanding and generating text in Portuguese.
• Gemma-7B-Instruct [17]: A general-purpose instruction-tuned version of the Gemma 7B model,
optimized for dialogue and reasoning tasks.
1Few-shot prompting was initially considered but was discarded due to input truncation caused by the limited context window
of the models.</p>
        <p>• Llama-3.1-8B-Instruct [18]: An instruction-tuned variant of LLaMA 3.1 with 8 billion
parameters, ofering high performance on a wide range of reasoning tasks.
• Llama-3.2-3B-Instruct [18]: A lightweight version of LLaMA 3.2 (3B), chosen for its small
footprint and fast inference capabilities.
• Medgemma-4B-Instruct [5]: A smaller, biomedical-focused variant of Gemma, tuned on clinical
and health-related instructions.
• Qwen2.5-7B-Instruct [19]: A 7-billion parameter instruction-tuned model based on the Qwen
2.5 architecture, designed for general-purpose reasoning and generation tasks with eficient
performance.</p>
        <p>These models were selected based on a balance between size, instruction-following capabilities, and
accessibility. Our goal was to prioritize lightweight models (3B to 8B parameters) that could be easily
deployed on modest hardware, such as edge devices or memory-constrained servers.</p>
        <p>Small and mid-sized models typically demonstrate lower performance compared to large-scale
models across a wide range of evaluation metrics and benchmarks. In a recent comparison involving
30 benchmark tasks, GPT-4 outperformed lightweight models (with up to 8B parameters) in 26 cases,
based on metrics such as ROUGE, accuracy, and task-specific scores [20].</p>
        <p>While larger models might ofer superior summarization quality, our focus was on finding eficient and
practical solutions suitable for researchers and users with limited computational resources, particularly
in clinical or academic settings where access to large-scale infrastructure may be constrained.</p>
        <p>Smaller models enable real-time summarization directly within healthcare facilities or on portable
devices, facilitating faster decision-making and improving patient care without requiring expensive or
complex infrastructure.</p>
        <p>We used the FastLanguageModel.from_pretrained interface from Unsloth to load each model
with 4-bit quantization. The models were configured with a context window of 8192 tokens and a
generation limit of 512 new tokens. This setup allowed eficient testing without requiring access to
high-end GPUs.
4.1.1. Prompt Design
All models were tested using the same prompt, designed to reflect the structure and content expected
in a professional clinical summary.</p>
        <p>Given the central role of instruction prompts, especially in zero-shot scenarios with general-purpose
models, we paid particular attention to the design and refinement of our prompt for this task. In the
absence of in-context examples, the prompt alone defines the structure, tone, and granularity of the
expected output.</p>
        <p>During early experiments, we tested multiple prompt formulations to evaluate their efect on model
behavior. Minimal or short prompts (e.g., “Summarize the following clinical case”) often produced
generic outputs lacking key clinical details. Moreover, without an explicit instruction to reduce the
input length, the models tended to generate overly long summaries. The requirement of a 75% reduction
in text length helped constrain verbosity and improved focus on relevant findings.</p>
        <p>The selected prompt struck a balance between guidance and conciseness. It proved robust across
diferent model architectures and sizes (e.g., 3B to 8B parameters), and consistently improved
informativeness and coherence compared to shorter or less explicit alternatives. To emulate expert-authored
case summaries, we also encouraged a chronologically coherent narrative that preserves the logical
lfow of clinical reasoning. We observed that models were particularly sensitive to the presence (or
absence) of directives regarding summary length and content types.</p>
        <p>The prompt was written in Portuguese to ensure alignment with the language of the input cases:
Você é um especialista em medicina clínica. Leia o caso clínico e produza um resumo clínico técnico, objetivo,
conciso, claro e profissional com as seguintes características:
- Faça uma redução de pelo menos 75% do conteúdo original, mantendo a essência do caso. O texto gerado deve
ter entre 5 a 10 frases e entre 50 a 150 palavras.
- Identifique dados principais do paciente (idade, sexo, antecedentes importantes).
- Descreva os sintomas e sinais apresentados, com tempo de evolução.
- Resuma os exames relevantes e os achados principais.
- Aponte as hipóteses diagnósticas e diagnóstico final, se houver.
- Descreva a conduta realizada e a evolução do paciente.
- Escreva um parágrafo coeso e conciso contendo essas informações principais. Não escreva em tópicos, mas
em texto fluido.</p>
        <p>Escreva com clareza, precisão e fluidez técnica, como um profissional da área da saúde humana, em um texto
descritivo com narrativa cronológica simples.</p>
        <p>Siga as instruções e forneça o resumo no final.</p>
        <p>The English translation of the prompt is provided below for clarity and accessibility.</p>
        <p>You are a specialist in clinical medicine. Read the clinical case and produce a technical, objective, concise, clear,
and professional clinical summary with the following characteristics:
- Reduce the original content by at least 75%, keeping the essence of the case. The generated text should contain
between 5 to 10 sentences and between 50 to 150 words.
- Identify key patient data (age, sex, relevant medical history).
- Describe symptoms and signs presented, including time of onset/evolution.
- Summarize relevant exams and main findings.
- Indicate diagnostic hypotheses and final diagnosis, if available.
- Describe the treatment and the patient’s clinical outcome.
- Write a cohesive and concise paragraph containing this core information. Do not use bullet points, but rather
a fluid narrative.</p>
        <p>Write clearly, precisely, and with technical fluency, as a healthcare professional would, using a simple
chronological structure.</p>
        <p>Follow the instructions and provide the summary at the end.</p>
        <p>The final prompt explicitly instructs the model to produce concise, fluent, and structured summaries
emulating clinical documentation style. It specifies essential elements to include, such as patient
demographics, symptom evolution, diagnostic reasoning, relevant exams, and treatment outcomes. It
also imposes constraints on summary length and discourages bulleted or disjointed text. This design
was guided by two main goals: (i) ensuring the inclusion of clinically relevant information; and (ii)
enforcing a professional tone and format aligned with medical communication standards.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Supervised Fine-Tuning</title>
        <p>Building on insights from zero-shot evaluation, we fine-tuned MedGemma [ 5], a domain-adapted
language model specialized in clinical and biomedical text generation, to develop a new model named
MedGemma-Sum-Pt.</p>
        <p>The model was fine-tuned to better adapt to the task based on the provided training data and a fixed
prompt format. We used the same instruction prompt as in the zero-shot experiment, to ensure that the
model internalize the expected output structure, reducing the risk of format mismatch or unpredictable
behavior during summarization. The fine-tuning was performed exclusively on the gold-standard
dataset provided by the MultiClinSum organizers, containing high-quality expert summaries essential
for training accurate clinical summarization models.</p>
        <p>Our training strategy utilized a parameter-eficient fine-tuning (PEFT) technique, LoRA (Low-Rank
Adaptation) [21], to eficiently update a subset of model parameters. Unlike traditional fine-tuning,
which updates all model weights and requires substantial computational resources, LoRA introduces
trainable low-rank matrices into specific layers (e.g., attention and language layers) while keeping
the original model weights frozen. This significantly reduces the number of trainable parameters and
accelerates training. As illustrated in Figure 2, this approach enables a compact and eficient adaptation
process.</p>
        <p>We configured LoRA with a rank of  = 8, which balances parameter eficiency and expressiveness,
as evidenced by [21]. Adaptation modules were injected into the query and value projection matrices
within the attention layers of the transformer, which have been shown to be particularly efective for
ifne-tuning language models on downstream NLP tasks. Since the task is purely textual, all
visionrelated components of the architecture were disabled. Additionally, 4-bit quantization was employed to
optimize memory usage during training, allowing training on modest hardware without significantly
degrading model accuracy.</p>
        <p>Training was conducted using the SFTTrainer framework [22], with the following hyperparameters:
a per-device batch size of 4 with gradient accumulation over 4 steps, a learning rate of 2e-4, and
a total of 3 training epochs (375 optimization steps). To reduce memory consumption, we used the
adamw_8bit optimizer, combined with a linear learning rate scheduler and a brief warm-up phase of 5
steps. The entire training process took approximately 6.4 hours on a single NVIDIA T4 GPU (16 GB),
with a peak memory usage of around 11 GB. This demonstrates the feasibility of fine-tuning using
afordable, widely accessible hardware rather than relying on high-end infrastructure.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Strategy</title>
        <p>To assess the quality of summaries produced by both zero-shot prompting and fine-tuned generation,
we set aside the first 50 examples from our 592-report gold-standard training set. These 50 cases were
excluded from fine-tuning and used solely as an internal validation set for comparative evaluation.</p>
        <p>
          We adopted the automatic evaluation metrics proposed by the MultiClinSum task organizers: ROUGE
and BERTScore, which provide surface-level and semantic similarity measurements, respectively. For
BERTScore, we used the biobertpt-all model [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], a Portuguese clinical and biomedical BERT model,
ensuring domain and language alignment with the evaluation setting.
        </p>
        <p>In our internal evaluation, we consider the performance of instruction-tuned models in the zero-shot
setting as baselines, against which we compare improvements achieved through supervised fine-tuning.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Internal Evaluation Results</title>
        <p>We evaluated both strategies, zero-shot prompting (our baselines) and supervised fine-tuning
(MedGemma-Sum-Pt), on our internal validation set of 50 gold-standard examples held out from training.
Table 1 summarizes the corresponding ROUGE and BERTScore results.</p>
        <p>As we can see from the internal evaluation results, the zero-shot models achieved reasonable
performance when guided by a well-crafted prompt, especially the Llama-3.1-8B-Instruct, which slightly
outperformed others in terms of ROUGE and BERTScore. Despite the Llama-3.1-8B-Instruct
model’s slightly better zero-shot performance, we chose to fine-tune MedGemma due to its smaller size
(4B) and native biomedical specialization. Our goal was to develop a lightweight yet efective model
that could be fine-tuned and deployed with limited computational resources. Even without fine-tuning,
MedGemma showed competitive results, and its domain adaptation made it a natural candidate for clinical
tasks.</p>
        <p>While the zero-shot models demonstrated competitive results with well-designed prompts, our
ifne-tuned MedGemma-Sum-Pt model consistently outperformed all baselines across all metrics. The
improvement in ROUGE-2 and ROUGE-L scores suggests that fine-tuning helped the model better
capture both local and structural coherence of the summaries. Likewise, the increase in BERTScore F1
demonstrates an enhanced semantic alignment with the expert references.</p>
        <p>To illustrate, we selected an example from our test set in which the fine-tuned model achieved
0.43 in ROUGE-L and 0.776 in BERTScore, while the zero-shot model obtained 0.26 in ROUGE-L and
0.668 in BERTScore (as shown in Figure 3). In this case, the fine-tuned model demonstrated superior
lexical and semantic similarity to the reference summary. Qualitatively, the summary generated by the
ifne-tuned model more accurately preserved essential clinical information, such as the patient’s age,
renal transplant history, absence of CMV infection prior to the episode, decreased visual acuity in the
left eye, and treatment with ganciclovir. In contrast, the zero-shot model included content not present
in the reference summary, such as bilateral eye involvement (whereas the summary refers only to the
left eye) and herpes simplex infection. In this example, the fine-tuned model produced a more concise
and focused summary that preserved the case’s key points, which explains its higher metric scores.</p>
        <p>Across other examples, common zero-shot errors included additional clinical information not present
in the reference summaries, omission of relevant lab findings, and inconsistent clinical timelines. In
contrast, the fine-tuned model showed higher fidelity to the reference data.</p>
        <p>This stage helped us identify the limitations of zero-shot summarization with compact models,
motivating the transition to a fine-tuning approach. Based on these findings, we selected the MedGemma-Sum-Pt
model, our best-performing approach, for the oficial submission to the MultiClinSum task.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Oficial Results</title>
        <p>In this section, we present the oficial results of our system on the shared task.</p>
        <p>Based on our internal evaluation, we selected our fine-tuned MedGemma-Sum-Pt model as our
oficial submission to the Portuguese track of the MultiClinSum task. The model demonstrated the most
consistent performance across all evaluation metrics when compared to zero-shot baselines.</p>
        <p>The submitted system was evaluated on the hidden test set by the task organizers using the oficial
metrics: ROUGE and BERTScore. Our oficial results are shown in Table 2.</p>
        <p>These results show that the model maintained a strong semantic alignment with the gold summaries
(as reflected by the BERTScore), although lexical overlap (measured by ROUGE) was more modest.
This suggests frequent paraphrasing or alternative expressions, which is common in clinical language.
Since ROUGE relies on n-gram overlap, it may underestimate quality when summaries are semantically
correct but lexically diverse.</p>
        <p>At the time of writing, comparative results from other participants have not been made publicly
available, which limits a more detailed contextualization of our system’s performance within the
competition.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Discussion</title>
        <p>Our findings suggest that even small instruction-tuned models can perform competitively in
clinical summarization tasks when fine-tuned with domain-specific data and guided by carefully crafted
prompts. The MedGemma-Sum-Pt model delivered strong semantic results with minimal infrastructure,
highlighting the feasibility of low-cost clinical NLP solutions.</p>
        <p>Our model demonstrated close alignment with expert-authored summaries in terms of meaning
and clinical relevance, as indicated by the high BERTScore. This metric captures meaning similarity
beyond exact word matches, suggesting that MedGemma-Sum-Pt was able to convey clinical information
reliably even when using diferent phrasing. The BERTScore results include precision, recall, and F1
metrics, which together provide a more nuanced understanding of model performance. In our results,
the close values of precision and recall suggest that MedGemma-Sum-Pt maintains a good balance
between generating accurate and comprehensive clinical summaries.</p>
        <p>In contrast, ROUGE scores were more modest, reflecting less lexical overlap. This discrepancy may
be explained by the abstractive nature of the model’s outputs, which often generate paraphrases or
reformulations that preserve meaning but diverge lexically from the reference summaries. Additionally,
the model may have learned to prioritize semantic coherence over surface-level n-gram overlap, which
aligns with the goals of clinical summarization but can penalize ROUGE scores. This behavior is
consistent with findings in the literature, such as [ 23], where abstractive models evaluated on clinical
texts in English showed similarly low ROUGE scores despite strong semantic fidelity as captured
by BERTScore. The persistence of this pattern in our Portuguese-language setting suggests that the
trade-of between lexical overlap and semantic alignment may be a broader characteristic of abstractive
summarization.</p>
        <p>Furthermore, these findings highlight the importance of combining lexical and semantic metrics for
evaluating clinical summarization, where preserving meaning is paramount.</p>
        <p>Our findings also underscore the central role of instruction prompts in clinical summarization with
language models. The prompt used in this work was carefully crafted to guide model behavior and
ensure consistency, particularly in zero-shot settings. While efective, this manually designed prompt
may limit generalization to other formats or clinical domains. Future work should explore systematic
prompt optimization or adaptation techniques, including prompt tuning and evaluation of diverse
prompt formulations, to reduce reliance on handcrafted instructions and improve robustness across
diverse inputs.</p>
        <p>Overall, this work presents a practical approach to clinical summarization in Portuguese using
parameter-eficient fine-tuning. While our setup prioritizes accessibility and resource eficiency, further
studies comparing performance with larger models are needed to fully assess trade-ofs in summarization
quality. These findings contribute to ongoing eforts toward scalable, language-specific NLP solutions
in the clinical domain.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>This work presented a practical approach for clinical summarization in Portuguese by fine-tuning a
compact, biomedical-specialized model, MedGemma. We showed that even with limited computational
resources and a relatively small training dataset, it is possible to achieve competitive results, particularly
in semantic alignment with expert summaries.</p>
      <p>Our internal evaluation indicated the advantage of fine-tuning over zero-shot strategies,
highlighting the importance of domain adaptation for clinical NLP tasks. Furthermore, the combined use of
BERTScore and ROUGE metrics provided a complementary view of summary quality, capturing both
semantic fidelity and lexical overlap.</p>
      <p>Our participation in the MultiClinSum challenge underscores the potential of compact,
domainadapted models for multilingual clinical summarization and reinforces their viability for real-world
deployment in medical settings.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Limitations</title>
      <p>We acknowledge limitations related to the size of the dataset used for training and evaluation scope.
Future work should explore larger datasets and more comprehensive external evaluations. Specifically,
we plan to incorporate the larger Portuguese clinical dataset (25,902 examples) into the training process
to assess whether a broader data foundation improves summarization quality and robustness. Moreover,
while our results show promise, the current evaluation does not include direct comparisons with
other participating systems in the MultiClinSum challenge. As oficial rankings and results become
available, we intend to conduct a more thorough comparative analysis to better contextualize our
model’s performance.</p>
      <p>We also observed sensitivity to prompt phrasing during generation, highlighting the need for
systematic prompt optimization strategies or prompt tuning techniques to reduce variability in output
quality. This issue was particularly evident in our zero-shot experiments.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was supported by the Brazilian National Council for Scientific and Technological Development
(CNPq), under Project 441610/2023-4. And the Natural Sciences and Engineering Research Council of
Canada (NSERC) discovery grant program (RGPIN-2021-04130).</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT to assist with grammar and spelling
checks. After using this service, the author(s) reviewed and edited the content as needed and take(s)
full responsibility for the publication’s content.
To support further research in clinical natural language processing, particularly within
Portugueselanguage healthcare contexts, our MedGemma-Sum-Pt model is publicly available on Hugging Face at
https://huggingface.co/pucpr-br/medgemma-pt-finetuned-multiclinsum, released under a research-only
license.</p>
      <p>As it was built upon the original MedGemma model, its use is restricted to academic and
noncommercial purposes, in accordance with the original licensing constraints.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L. E. S. e.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. M. C. M. Barra</surname>
            ,
            <given-names>S. S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hasan</surname>
          </string-name>
          ,
          <article-title>Assembling natural language processing resources to perform the summarization of clinical narratives</article-title>
          ,
          <source>Tese (doutorado)</source>
          ,
          <source>Pontifícia Universidade Católica do Paraná, Curitiba</source>
          ,
          <year>2020</year>
          . URL: https://archivum.grupomarista.org.br/ pergamumweb/vinculos/000094/0000943a.pdf, acesso em:
          <volume>21</volume>
          <fpage>dez</fpage>
          .
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E. T. R.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. V.</given-names>
            <surname>A. de Souza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Knafou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E. S. e.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Copara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. B.</given-names>
            <surname>Gumiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. F. A. d.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Paraiso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Teodoro</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. M. C. M. Barra</surname>
          </string-name>
          , BioBERTpt
          <article-title>- a Portuguese neural language model for clinical named entity recognition</article-title>
          ,
          <source>in: Proceedings of the 3rd Clinical Natural Language Processing Workshop</source>
          , Association for Computational Linguistics, Online,
          <year>2020</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          . URL: https://www.aclweb.org/anthology/2020.clinicalnlp-
          <volume>1</volume>
          .7.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodríguez-Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lima-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Escolano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Melero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pratesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vigil-Gimenez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farré-Maduell</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Krallinger, Overview of MultiClinSum task at BioASQ 2025: evaluation of clinical case summarization strategies for multiple languages: data, evaluation, resources and results</article-title>
          ., in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. N. Maria</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Nunzio</surname>
          </string-name>
          , Giorgio,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering</article-title>
          , in: L.
          <string-name>
            <surname>P. A. G. S. d. H. J. M. F. P. P. R. D. S. G. F. N. F. Jorge Carrillo-de Albornoz</surname>
          </string-name>
          , Julio Gonzalo (Ed.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>