<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>X (M. Rodríguez-Ortega);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Overview of MultiClinSum Task at BioASQ 2025: Evaluation of Clinical Case Summarization Strategies for Multiple Languages: data, evaluation, resources and results.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miguel Rodríguez-Ortega</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduard Rodríguez-López</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Salvador Lima-López</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Escolano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Audrey Mash</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maite Melero</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Pratesi</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Vigil-Gimenez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leticia Fernandez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eulàlia Farré-Maduell</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Krallinger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Barcelona Supercomputing Center</institution>
          ,
          <addr-line>Plaça Eusebi Güell, 1-3, 08034 Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hospital Universitari Mútua Terrassa</institution>
          ,
          <addr-line>Plaça del Doctor Robert, 5, 08221 Terrassa, Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Parc Tauli Hospital Universitari, Parc Taulí</institution>
          ,
          <addr-line>1, 08208 Sabadell, Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Translated</institution>
          ,
          <addr-line>Via Indonesia, 23, Rome, Latium 00144</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Recent developments in generative AI-based solutions, particularly large language models (LLMs), are enabling high-impact use cases not only in general-domain applications but also in biomedical and clinical domains. Healthcare professionals and researchers face significant challenges related to eficiency, patient safety, and the delivery of high-quality care, largely due to the ever-growing volume of lengthy clinical documents, including electronic health records and clinical case reports. Automatic clinical document summarization, especially across multiple languages beyond English, has the potential to enhance healthcare system eficiency, improve care management, and support clinical research involving unstructured health data. A key requirement for such systems is the ability to generate high-quality summaries that preserve essential clinical insights, such as patient characteristics, diagnoses, therapeutic indications, and outcomes, while minimizing biases. Achieving this requires a robust evaluation framework to benchmark the quality and accuracy of the generated summaries from unstructured health data. To address this need, we organized the MultiClinSum shared task as part of BioASQ/CLEF 2025. This task focused on evaluating automatic summarization of clinical case reports in four languages: English, Spanish, French, and Portuguese. A total of 10 teams submitted 50 runs for the MultiClinSum task, with the majority of top-performing systems adopting abstractive approaches built on decoder-only architectures such as Qwen and MedGemma. These models were often fine-tuned on biomedical corpora using parameter-eficient strategies like LoRA or quantization. A few extractive methods were also explored, they generally achieved lower performance, highlighting the advantage of abstractive techniques for capturing nuanced clinical information. Overall, the participating systems showed promising performance, achieving results of 0.870 BERTScore F1 for English, 0.758 for Spanish, 0.752 for French, and 0.747 for Portuguese. We have publicly released the MultiClinSum corpora to support the development and evaluation of new clinical summarization systems. The dataset includes both a multilingual gold standard of human-written summaries and a large-scale collection derived from the PMC-Patients collection. This resource also includes additional datasets (Romanian, Italian, Dutch, Swedish, Czech, Catalan, Norwegian, Danish, German, Russian and Greek), extending beyond the four languages covered in the shared task. The MultiClinSum resources ae available at: https://zenodo.org/records/15546018</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;text summarization</kwd>
        <kwd>Gen AI</kwd>
        <kwd>clinical case reports</kwd>
        <kwd>biomedical adaptation</kwd>
        <kwd>multilingual</kwd>
        <kwd>clinical NLP</kwd>
        <kwd>generative pre-trained transformers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The volume of information available to clinicians and biomedical researchers is growing rapidly,
both in terms of biomedical literature and in the form of clinical records [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Accurate patient care
requires that clinicians are able to eficiently retrieve, interpret, and integrate relevant information
from multiple data sources. Biomedical researchers need to navigate humongous amounts of literature
to generate new hypotheses and stay current with research advancements in their fields. Electronic
resources—such as online literature databases and electronic health record (EHR) systems—have been
developed to support clinicians and researchers in managing this expanding body of information.
For doctors, the widespread adoption of EHRs has significantly increased the clinical documentation
workload, contributing directly to rising stress levels and clinician burnout.
      </p>
      <p>
        Clinicians currently devote substantial time to summarizing large volumes of textual
information—whether compiling diagnostic reports, writing progress notes, or synthesizing a patient’s
treatment history across multiple specialists [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Recent studies indicate that physicians may spend up to two hours on documentation for every hour of
direct patient care [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Even for experienced physicians with extensive expertise, this complex task
inherently carries the risk of errors, which can be particularly detrimental in a field where precision
is critical [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Failing to thoroughly review a patient’s detailed clinical history can lead to serious
medical errors [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These may include misdiagnosis, inappropriate treatment, and life-threatening
complications, often resulting from communication failures and a lack of coordination or sharing
of relevant information among care team members. Additionally, with the rapidly growing body of
clinical literature—particularly case reports—it is becoming increasingly challenging to identify and
interpret key medical insights that are relevant for evidence-based medicine and clinical decision support.
      </p>
      <p>
        The application of automatic text summarization methodologies ofers a robust solution to these
challenges, through the extraction of concise and coherent representations of extensive clinical texts
[
        <xref ref-type="bibr" rid="ref1 ref10 ref11 ref6 ref7 ref8 ref9">6, 7, 1, 8, 9, 10, 11</xref>
        ]. Automatic summarization aims to reduce the length of a text while preserving
its essential information content. Summarization can be extractive, selecting key sentences from the
original text, or abstractive, generating novel sentences that convey the main ideas more concisely.
With the advent of Large Language Models (LLMs), especially those fine-tuned for biomedical or
clinical language (e.g., BioBERT[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], ClinicalBERT[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], BioGPT[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]), abstractive summarization has
become increasingly feasible and efective. The generation of such summaries has been demonstrated
to facilitate healthcare professionals in the rapid and efective comprehension and prioritization of
patients’ clinical histories. Beyond the clinical environment, text summarization has applications
in medical education, evidence synthesis, and biomedical research, where fast access to essential
information from case reports, EHRs and biomedical literature can accelerate learning and discovery.
      </p>
      <p>
        Current research have shown that LLMs can produce clinically coherent and semantically
accurate summaries of complex medical documents, often approaching or exceeding the quality of
human-written summaries. These models are capable of handling nuances in medical language and
capturing key clinical insights, such as diagnoses, symptoms, interventions, and outcomes. However,
most existing work has focused on English-language texts, and there is a growing recognition of
the need to support multilingual clinical summarization. This is particularly important for enabling
equitable access to clinical knowledge, supporting international research collaboration, and ensuring
that non-English clinical data can be eficiently used in both local and global contexts. In a study by
Dave Van Veen et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the authors demonstrated that domain-adapted LLMs can achieve strong
performance across various clinical summarization tasks—including radiology reports, patient-provider
dialogues, and progress notes—sometimes even outperforming human-written summaries in terms
of completeness and correctness, as assessed in a physician-led reader study. The study highlights
both the potential and limitations of LLMs: while these models can reduce documentation burden and
enhance clinical eficiency, their performance varies considerably depending on the task and adaptation
strategy. The authors also emphasized the critical need to align automatic evaluation metrics with
human judgment, a key consideration in clinical contexts where summary quality directly afects
patient care.
      </p>
      <p>
        Building on this line of research, the study presented in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] demonstrated the efectiveness of
hybrid methods that combine extractive and abstractive techniques for summarizing electronic health
records (EHRs), particularly in high-stakes settings such as intensive care units (ICUs). The authors
integrated concept-based extraction with a T5-based transformer model to generate daily progress
note summaries, which were subsequently used to predict ICU patient length of stay (LOS). By
combining these summaries with structured data, their support vector machine (SVM)-based approach
achieved a prediction accuracy of 77.5%, outperforming existing systems. This use case—clinical
summarization for both human interpretation and machine learning input—underscores the practical
value of high-quality summaries in supporting clinical decision-making.
      </p>
      <p>
        Specific clinical domains, such as radiology, have also seen growing interest in summarization
tasks. A recent study addressed key limitations in the field, including the reliance on private datasets
and a narrow focus on chest X-ray data [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. By introducing the RadSum23 shared task at BioNLP
2023—which utilized the MIMIC-III and MIMIC-CXR datasets along with a newly released Stanford
test set—researchers expanded the scope of radiology summarization to include multiple imaging
modalities and anatomical regions. The task attracted significant participation, with 112 submissions
from 11 teams, underscoring the community’s commitment to developing robust and generalizable
summarization systems. These eforts are essential for advancing clinical NLP research, promoting
transparency, and enabling reproducibility in model evaluation.
      </p>
      <p>
        Beyond clinical applications, summarization research has also progressed through shared evaluation
campaigns such as the MSLR2022 shared task, which focused on multi-document summarization for
literature reviews [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ]. Held as part of the Scholarly Document Processing Workshop at COLING
2022, the task challenged participants to generate coherent summaries from biomedical abstracts,
simulating the synthesis typically performed in systematic reviews. By providing datasets from
Cochrane Reviews and the MS² corpus, the shared task showcased both the potential and current
limitations of existing summarization methods. The ProbSum 2023 shared task focused on summarizing
patients’ active diagnoses and problems from electronic health record (EHR) progress notes [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
The TAC BiomedSumm track asked teams to utilize citation sentences—known as citances—that
reference a specific paper for summarization. The task involved three main components: identifying
the corresponding text spans in the referenced papers that reflect the citances, classifying those spans
into predefined paper facets, and ultimately generating a summary of the referenced papers based on
the collective discussion provided by the citances.
      </p>
      <p>Summarization systems remain challenging to develop, as summaries must be not only linguistically
lfuent and coherent but also medically accurate and relevant. Moreover, most evaluation and
benchmarking eforts have focused on English content, leaving the quality and challenges of clinical text
summarization in other languages—and comparative performance across multiple languages—largely
understudied. This gap motivated the launch of the MultiClinSum shared task on clinical case report
summarization, as part of the BioASQ/CLEF 2025 initiative. The inclusion of multilingual clinical case
reports (in English, Spanish, French, and Portuguese) for the MultiClinSum task—using both native
and automatically translated clinical texts—represents an important step forward. It encourages the
development of summarization systems that are robust across languages and sensitive to the unique
requirements of the clinical domain.</p>
      <p>This paper presents the MultiClinSum shared task, a multilingual biomedical summarization challenge
focused on clinical case reports in English, Spanish, Portuguese, and French. The goal is to advance
automatic summarization methods capable of addressing the diversity and complexity of clinical
narratives across languages. The work is organized as follows: Section 2 outlines the task description
and evaluation system; Section 3 details the construction of the corpus and the resources provided
to participants; Section 4 describes the systems submitted, their underlying methodologies, and the
obtained results; Section 5 provides a comprehensive discussion of the outcomes, including
languagespecific observations and clinical implications; and Section 6 concludes the paper with final remarks
and potential directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task description</title>
      <p>
        The MultiClinSum shared task was organized as part of the BioASQ 2025 workshop, held under
the CLEF conference—an initiative that supports the systematic evaluation of information access
systems, primarily through experimentation with shared tasks. MultiClinSum specifically addressed
the automatic generation of clinical case summaries in four languages: English, Spanish, French, and
Portuguese. The provided gold standard corpus was constructed from a curated selection of full-text
clinical case reports and their corresponding summaries, derived from open-access clinical case report
publications. Additionally, a large-scale dataset was release built on the PMC-Patients collection [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ],
where patients summaries were used as full case while the summaries were extracted from specific
sections of the corresponding PubMed abstract. In both datasets, each available full-text and summary
pair was translated into the target languages to support multilingual experimentation. For evaluation
purposes, the automatically generated summaries submitted by participating teams were compared
against human-written summaries provided by the original article authors (or their corresponding
translations), using ROUGE-2 scores and BERTScore as evaluation metrics.
      </p>
      <p>The MultiClinSum shared task was organized into three main phases: training, testing, and
evaluation/submission. During the training phase, participants were provided with multilingual
datasets consisting of clinical case report texts and their corresponding summaries in English, Spanish,
Portuguese, and French. These training sets were obtained from both the curated clinical case
report collection (gold standard) and the PMC-Patients dataset (large-scale dataset), following a fixed
training split comprising 60% of the data. In the test phase, participants received full text clinical
case report texts without accompanying summaries, and their systems were expected to generate
automatically coherent clinical summaries. Each sub-track focused on one of the four target languages,
allowing participants to develop either monolingual systems or multilingual systems capable of
processing several languages. The task did not require cross-lingual summarization but emphasized
language-specific summary generation.</p>
      <p>In the evaluation phase, participating teams submitted their system outputs through the BioASQ
online platform, with submissions provided in ZIP format. Each team was allowed to submit up to five
runs per language. The evaluation of system-generated summaries was carried out using two automatic
metrics: ROUGE-L, which measures lexical overlap, and BERTScore, which assesses semantic similarity
A fair and consistent benchmarking process across all participating languages was ensured throughout
the evaluation phase.</p>
      <sec id="sec-2-1">
        <title>2.1. Subtracks</title>
        <p>To promote the development of clinically relevant summarization models across diverse linguistic
contexts, the task was organized into separate sub-tracks by language rather than using a single
multilingual corpus. This approach allows a more equitable evaluation and enables targeted adaptation
to language-specific and data resources.</p>
        <p>MultiClinSum was structured into four individual sub-tracks. Each track focuses in the
adaptation of automated text summarization systems using clinical case report texts in a specific language:
MutiClinSum-en for English, MutiClinSum-es for Spanish, MutiClinSum-fr for French and
MutiClinSumpt for Portuguese.</p>
        <p>Participants were allowed to participate in any of the subtracks as they wish, using either monolingual
or multilingual models. Notably, each sub-track had a diferently-sized dataset.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Evaluation</title>
        <p>
          The use of appropriate performance metrics to evaluate the scientific and technical robustness
and relevance of automatic clinical text summarization and generative AI models is essential for
an objective and transparent comparative assessment of the generated results by participating
teams. Evaluating automatically generated summaries requires careful consideration of the diverse
factors of summary quality, including fluency or factual consistency. Numerous metrics have been
proposed for this purpose, each with distinct strengths and limitations. Lexical overlap metrics
like ROUGE [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and METEOR [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] measure surface-level n-gram matches. Semantic-oriented
metrics such as BERTScore [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] leverage contextual embeddings to assess deeper textual
correspondence, while model-based metrics like BARTScore [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] and BLEURT [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] focus on fluency and coherence.
        </p>
        <p>
          To assess the quality of automatically generated clinical case summaries across all sub-tracks of the
MultiClinSum shared task, we employed two complementary evaluation metrics: ROUGE-L-sum [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]
and BERTScore [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. These metrics were selected to capture both surface-level lexical overlap and
deeper semantic correspondence between system-generated outputs and reference summaries.
By integrating both evaluation paradigms, we ensure that the assessment protocol reflects both
established standards in summarization and emerging best practices for clinical NLP. It also allows systems
to be evaluated from more than one single perspective.
        </p>
        <sec id="sec-2-2-1">
          <title>2.2.1. ROUGE-Lsum</title>
          <p>ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a standard suite of metrics widely used
in summarization research. ROUGE-Lsum is a sentence-level variant of the ROUGE-L metric, adapted
for multi-sentence summarization tasks. Rather than computing the longest common subsequence
(LCS) over entire documents, ROUGE-Lsum segments both the candidate and reference summaries by
sentence boundaries (typically newlines) and computes LCS scores independently for each sentence
pair. The final metric aggregates these scores to yield recall, precision, and F1 values, ofering a more
granular and structurally sensitive assessment than traditional ROUGE-L.</p>
          <p>
            Given it is an exact match measure, its relevance in an abstractive summarization task is challenged.
However, clinical summarization brings unique needs, where factual consistency and coverage of
key medical entities (e.g., diagnoses, treatments) are critical. Past evaluations on clinical dialogue
summarization found that while ROUGE metrics (including ROUGE-Lsum) correlate poorly with
human judgments of coherence and clinical utility, they remain useful for benchmarking surface-level
content overlap, particularly in extractive settings [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ]. Meta-evaluations of faithfulness metrics in
hospital-course summarization further highlight that ROUGE-Lsum’s sentence-level granularity can
help identify localized factual matches, though it fails to capture semantic fidelity or logical flow—gaps
often addressed by combining its results with semantic metrics [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ].
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. BERTScore</title>
          <p>BERTScore ofers a semantic evaluation framework based on contextualized token embeddings derived
from pre-trained transformer models, such as BERT. Unlike ROUGE, which relies on n-gram matching,
BERTScore computes pairwise cosine similarity between all tokens in the candidate and reference texts.
Greedy matching is then performed to align semantically similar tokens, from which precision, recall,
and F1 scores are computed.</p>
          <p>BERTScore has demonstrated superior alignment with human judgment in multiple summarization
benchmarks, particularly when reference and candidate summaries use divergent lexical forms
to express equivalent content. In the clinical domain, this is especially valuable, as synonymous
terminology and paraphrasing are common (e.g., “myocardial infarction” vs. “heart attack”).</p>
          <p>
            To evaluate participant entries in MultiClinSum, Facebook’s RoBERTa-Large [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ]1 was used for
English and a Google multilingual BERT model [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ] 2 was used for the rest of the languages (Spanish,
French and Portuguese).
          </p>
          <p>As part of the task, an oficial MultiClinSum evaluation script is available on GitHub 3. After the task
results were released, the test set Gold Standard annotations were shared with participating teams to
enable them to perform extra experiments and facilitate error analysis of their systems.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Corpus and resources</title>
      <p>
        For the MultiClinSum task, publicly available clinical case reports were used [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]—a specialized medical
publication type of key clinical relevance, which often resembles the kind of medical information found
in discharge summaries. Specific guidelines and reporting recommendations for clinical case reports
have been established, stating that a case report should follow a format that includes demographic
information, medical history, presenting concerns, clinical findings, diagnoses, interventions, outcomes
(including adverse events), and follow-up for a given patient [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. Clinical case reports are a type of
medical scientific publication that describe an individual patient’s medical history, detailing symptoms,
clinical findings, treatments, diagnostic reasoning, and other relevant medical information [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. In some
cases, these reports focus on rare diseases, unexpected treatment responses, or emerging health threats,
playing a crucial role in the dissemination of medical knowledge. Such texts became particularly
important during the COVID-19 pandemic, serving as a valuable source for studying symptoms and
disease progression. Public literature repositories such as PubMed Central (PMC) are key resources for
clinical case report publications, providing clinicians and researchers with access to a large number of
case studies. The widely used PubMed database, for example, contains nearly 2.5 million citations
corresponding to clinical case reports dating from 1.846 to the present.
      </p>
      <p>An illustrative example of a clinical case report, along with its corresponding summary for all available
languages, is presented in figures 1, 2, 3, and 4.
1https://huggingface.co/FacebookAI/roberta-large
2https://huggingface.co/google-bert/bert-base-multilingual-cased
3https://github.com/nlp4bia-bsc/MultiClinSumEval.git</p>
      <p>The usual length of the clinical case reports can vary depending on the complexity, structure of the
case, or journal publication instructions, but they typically range from 1,000 to 3,000 words. Based on
an internal analysis of the datasets provided in the task, we found a clinical case report mean length of
2,480 words.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset creation process</title>
        <p>Two distinct methodologies were employed in the data selection and construction of the task dataset:
one focused on creating a gold standard dataset of native clinical case report and summary pairs, and
another based on the PMC-Patients subset.</p>
        <p>For the gold standard dataset, full case-summary pairs were manually selected independently
for each language, resulting in 640 pairs in English, 277 in Spanish, 100 in Portuguese and 100 in
French. These clinical cases were primarily sourced from the PubMed database, using filters based on
publication type "case reports" and publication languages.</p>
        <p>
          The construction of the large-scale dataset started with the retrieval of the publicly available
PMC-Patients subset [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], consisting on clinical summaries of patients in English that have been
extracted from PubMed Central (PMC) full-text articles repository. To obtain the corresponding
author-generated summaries, the abstracts of these PMC articles were processed to isolate specific
text fragments which briefly describe the patient clinical case. This extraction was done by manually
ifltering abstract sections of interest (e.g., clinical case, presentation of the case, case presentation).
Following this process, the filtered abstract sections were mapped to their corresponding PMC-Patient
record, thereby generating a total of 43,750 full case-summary pairs. As last step, the pairs were
pre-processed involving three major steps: text pre-processing, de-duplicating records, and removing
noisy data. Text pre-processing was limited to deleting HTML artifacts from the documents, as well as
ifgure and table references mentioned within the case summaries. Duplicate entries—both within the
large-scale dataset and across the native gold standard—were removed based on PMC IDs, as well as
exact string matching of either cases or summaries, reducing the dataset to 40,650 samples. Additional
iflters eliminated veterinary cases and case series using substring heuristics, further narrowing the
dataset to 40,063 samples.
        </p>
        <p>While the large-scale dataset created from PMC abstracts and case descriptions provided substantial
coverage, its construction introduced the challenge of variability in the quality and structure of
authorgenerated summaries. In clinical writing, summarization practices can vary significantly from one
case to another—some authors include extensive details to be noteworthy, while others could be more
selective and concise. This inconsistency can introduce noise in both model training and evaluation
phases. To prevent this variability, an outlier detection strategy based on content length distribution
was performed. Specifically, we identified and remove cases considered statistical outliers using a 3 ×IQR
(interquartile range) threshold applied to three features: the relative length of the summary compared
to the full case text, the absolute word count of the summary, and the number of sentences. All three of
these distributions were long-tailed, so one of their boundaries was set qualitatively. Once these data
ifltering phase was completed under the aforementioned conditions, the resulting large-scale dataset
comprised a total of 38,853 full case-summary pairs in English. Figure 5 shows the whole process of the
large-scale dataset creation.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Machine translation process and quality control of translations</title>
        <p>
          To generate a larger multilingual MultiClinSum dataset, machine translation techniques were used,
employing the SalamandraTA-7B-instruct architecture4, a fine-tuned version of the Salamandra large
language model[
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] that was optimized for translation across 35 oficial European languages. All
4https://huggingface.co/BSC-LT/salamandraTA-7b-instruct
examples were produced using the Transformers library5 with the beam search decoding algorithm
with 5 beams.
        </p>
        <p>
          Since the original dataset contains only monolingual examples, the quality of the generated
translations was assessed through quality estimation using COMET-KIWI[
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]. This metric, ranging from 0 to
1, computes sentence embeddings for both the input sentence and its translated counterpart, followed
by calculating the similarity between the resulting vectors. Higher similarity scores indicate greater
semantic alignment between the sentences, suggesting that the meaning of the original input has
been efectively preserved. The results show an average score of 0.81 across the four languages tested,
ranging from 0.75 for Portuguese–English translations to 0.82 for English–Spanish. Across all language
pairs, 91% of the documents achieved a score of 0.7 or higher, indicating robust performance across the
entire dataset.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Corpus statistics</title>
        <p>MultiClinSum corpus text statistics were calculated for each language for both gold standard and
large-scale created datasets. These statistics include the total number of sentences and total tokens, as
well as their average value across the corresponding dataset. Tables 1 and 2 illustrate the computed
statistics of full cases and summaries repectively. The English sub-track reveals the lowest number of
sentences per document in comparison to the other languages, both in relation to the gold standard and
the large-scale dataset.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. MultiClinSum additional resources</title>
        <p>In order to carry out additional analyses beyond the MultiClinSum shared task settings and the four
main target languages used, we also generated additional alternative translations. These serve to further
characterize the impact of diferent machine translation systems, as well as to provide a scenario for data
augmentation. To ensure high-quality medical text translations for the MultiClinSum task additional
dataset, the shared task organizers selected Translated’s Lara system, in addition to leveraging the
5https://github.com/huggingface/transformers
previously described in-house translation tools developed at BSC. Translated generously provided
these translations free of charge in support of the task. Lara had previously been employed in clinical
NLP applications within the DataTools4Heart project6, where it demonstrated strong performance in
complex clinical translation scenarios and was validated by professional medical translators. Thus
additional alternative translations were provided for clinical cases in English, Spanish, French and
Portuguese (MultiClinSum data augmentation collection). Moreover, extended MultiClinSum corpora
were generated for the following languages: Romanian, Italian, Dutch, Swedish, Czech, Catalan,
Norwegian, Danish, German, Russian and Greek. Some of the author-provided clinical case summaries
used for the MultiClinSum task are, from a clinical perspective, more complete, comprehensive, and
focused specifically on patient information, rather than on general motivation, relevance, or background
of the discussed diseases. To assess the clinical quality and completeness of these case report summaries,
clinical experts were asked to classify them using specific quality labels (very complete, complete, and
partially complete), as well as labels reflecting the particular clinical information covered (e.g., patient
demographics, clinical presentation, diagnosis, intervention and treatment, outcome, and follow-up).
This additional set of clinical label classifications will be released alongside the test set clinical case
reports to support further research and analysis of clinical summarization approaches.</p>
        <sec id="sec-3-4-1">
          <title>Language Nr. full cases Nr. sentences Nr. tokens</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Participation overview</title>
        <p>In general, there has been a very satisfactory participation in the task with promising results in each of
the sub-tracks. A total of 60 teams registered for the MultiClinSum task, of which 10 teams submitted
at least one run for a given sub-track as presented in Table 3. Specifically, 8 teams participated in the
988
998
1061
1034
English sub-track, 6 teams in the Spanish, 5 teams in French, and 6 in Portuguese. Each team was
allowed to submit up to 5 runs per sub-track. As expected, the best results were obtained in the English
sub-track (MultiClinSum-en), which had the highest level of participation. Nevertheless, the others
sub-tracks were quite well represented in terms of both participation and novel methodologies applied.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. System results</title>
        <p>The results for the MultiClinSum for each sub-track are shown in tables 4, 5, 6, 7. Across the four
languages, a few teams’ system consistently demonstrated strong performance in both lexical and
semantic evaluation metrics. The most consistent team was pjmathematician [35], which achieved top
or near-top BERTScore F1 values across all languages and ranked first in ROUGE-F1 in the French and
Portuguese sub-tracks. In the English sub-track, team seemdog led overall with the highest BERT-F1
(0.870), showcasing particularly efective semantic understanding in this language. Meanwhile, in
the Spanish track, team grazhdanski [39] outperformed the rest with the best BERTScore and strong
ROUGE-F1.</p>
        <p>Other teams also showed competitive results in specific areas. For instance, MaLei [37] and MedCOD
[42] ranked highly in ROUGE precision scores, particularly in English, indicating strong surface-level
alignment with the test set summaries. Team BU Team [36] performed consistently across all sub tracks,
with particularly strong BERT recall in French and Portuguese. Some teams, such as ExtraSum [38],
achieved high lexical precision (ROUGE-P) but not so well in terms of semantic similarity, highlighting
the dificulty of achieving good results in both. Overall, the results show the complexity of clinical
summarization across languages and suggest that systems prioritizing semantic representation, such as
those performing better in terms of BERTScore, tend to generalize more efectively across linguistic
contexts.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Methodologies by team</title>
        <p>This section provides an overview of the methodologies employed by participating teams in the
MultiClinSum shared task. Despite the diversity of approaches, most systems were built upon large
language models, with varying strategies in pre-processing, fine-tuning, and generation. Participants
explored both extractive and abstractive paradigms, leveraging encoder-decoder and decoder-only
architectures, as well as monolingual and multilingual setups. Below, we briefly summarize the main
methodological choices made by each team.</p>
        <p>Team ExtraSum</p>
        <p>This team compared four existing extractive summarization approaches: graph-based, concept-based,
topic-based and cluster-based techniques. The approach emphasizes factual consistency by preserving
original sentences and highlights language-specific performance trends (e.g., tokenizer artifacts in
Spanish). Clustering-based selection outperforms other extractive approaches in both ROUGE and
BERTScore, yet still falls short when comparing against abstractive approaches.</p>
        <p>Team ÉTS-PUCPR</p>
        <p>This team present MedGemma-Sum-Pt, a lightweight model for automatic summarization of
Portuguese clinical case reports. Addressing the challenges posed by limited annotated resources
and the linguistic complexity of medical Portuguese, the authors explore two strategies: (i) zero-shot
prompting using instruction-tuned multilingual language models, and (ii) supervised fine-tuning of
the domain-specific MedGemma model using LoRA, a parameter-eficient adaptation method. The
ifne-tuned model demonstrates superior performance in internal and oficial evaluations, particularly
in terms of semantic fidelity as measured by BERTScore. Despite being trained on a relatively small
expert-annotated dataset and deployed using modest computational resources, MedGemma-Sum-Pt
outperforms all tested zero-shot baselines. The authors release the model publicly to support further
research in clinical NLP for low-resource languages, highlighting the viability of domain-adapted,
compact models for real-world medical applications.</p>
        <p>Team MedCOD</p>
        <p>The authors of this team proposed MedCOD, a contextual augmentation framework to enhance
multilingual medical summarization. The approach begins by extracting medical keywords from
full clinical texts using the Qwen2.5-14B model. These keywords are then translated into five target
languages (EN, ES, FR, DE, PT) using the NLLB-3.3B model, and validated through back-translation and
semantic equivalence checking. The validated multilingual keywords are incorporated into structured
prompts to provide contextual input for LLMs. Experiments were conducted using Qwen2.5-14B
and Phi-4B in both zero-shot and fine-tuned settings, and fine-tuning was performed using LoRA
(Low-Rank Adaptation) dataset. The study shows that combining MedCOD prompting with fine-tuning
leads to improved summarization quality, particularly in non-English languages. Notably, even without
ifne-tuning, MedCOD prompts alone provided substantial gains in languages like Portuguese and
French, highlighting its efectiveness as a lightweight adaptation method.</p>
        <p>Team grazhdanski</p>
        <p>In this case, the author explores the use of Group Relative Policy Optimization (GRPO), a
reinforcement learning approach, for summarizing Spanish clinical case reports within the MultiClinSum
shared task at CLEF 2025. Several adaptation strategies are evaluated using LLaMA 3.1 8B-Instruct,
including supervised fine-tuning, domain adaptation, and GRPO-based training. The GRPO-trained
model, optimized with ROUGE-L and BERTScore reward functions, outperforms both fine-tuned and
zero-shot baselines, achieving the best performance on the oficial test set. The study highlights the
potential of GRPO for improving clinical summarization in low-resource settings.</p>
        <p>Team JohannaUE</p>
        <p>This team present Agentic MCS, a multilingual clinical summarization framework combining
extractive and abstractive techniques using a modular LangGraph-based architecture with components
such as NER-guided entity preservation, knowledge graphs, and fine-tuned or prompted language
models. Evaluated across four languages, it achieves strong semantic performance (BERTScore) and
highlights the trade-of with lexical overlap (ROUGE). The study demonstrates the efectiveness of
hybrid, agent-based pipelines for domain-specific biomedical summarization.</p>
        <p>Team BU Team</p>
        <p>The authors of this team propose a two-stage distillation framework for multilingual clinical
summarization using Qwen2.5 models family[43]. In the first stage, a teacher model (Qwen-2.5-72B-instruct 7)
extracts key clinical information; in the second stage, the student model (Qwen2.5-0.5B-Instruct8) is
trained with a dual loss that supervises both summarization and alignment to the extracted information.
The purpose of this secondary task is not for inference but rather to force the model to represent the
relevant clinical concept and therefore have a greater chance of including them in the summary.</p>
        <p>Team MaLei</p>
        <p>MaLei team implemented a prompt-based framework using Iterative Self-Prompting (ISP) to guide
large language models, specifically GPT-4 and GPT-4o. Their approach involved crafting a meta-prompt
that combined Chain-of-Thought instructions, clinical perspectives, and evaluation-based feedback,
which was refined iteratively using few-shot examples and model-generated reflections. They used
BERTScore and ROUGE-L to monitor performance across refinement epochs, optimizing prompt
versions until improvements plateaued. The final prompt version was applied to generate summaries
on the English test set, with additional regeneration steps to enforce conciseness where needed.</p>
        <p>Team pjmathematician</p>
        <p>This team employed a multilingual summarization approach based on fine-tuned Qwen family
language models, enhanced through Low-Rank Adaptation (LoRA). They developed an automated
prompt optimization framework in which a "judge" model iteratively refined a system prompt for a
"worker" model, aiming to maximize ROUGE scores. The final optimized prompt emphasized extractive
ifdelity, closely mirroring the terminology and structure of the source clinical reports.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Methodologies by approach</title>
        <p>This subsection presents an in-depth analysis of the system performance in the shared task, examining
how diferent methodological choices impacted summarization quality across languages. Given the
diversity of participating teams’ approaches, we analyze results along several key aspects: the type of
summarization strategy, model size and architecture, prompting techniques, and the influence of the
automatic translation systems. Table 8 provides a summary of the participant systems based on these
characteristics.</p>
        <p>Performance by summarization type</p>
        <p>The shared task attracted a wide range of approaches, including extractive (e.g., ExtraSum [38]),
abstractive (e.g., MaLei [37], MedCOD [42]). Team pjmathematician [35] used a decoder-based model,
which implies abstractive summarization, yet the automatically optimized prompt stated the summaries
should be extractive. One of the systems proposed by team JohannaUE [40], had a similar idea as the
7https://huggingface.co/Qwen/Qwen2.5-72B-Instruct
8https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
previous, but did so in a more defined manner (as opposed to stating it in a prompt) where there was a
ifrst step to rank the most relevant sentences using BM25 with subsequent LLM processing for quality
assurance and abstractive condensations.</p>
        <p>Abstractive systems generally outperformed extractive ones in BERT and ROUGE scores across all
languages. The extractive system ExtraSum was consistently outperformed by LLM-based abstractive
systems, despite showing competitive ROUGE precision in some languages (e.g., French, Portuguese).</p>
        <p>Extractive methods were also at a disadvantage given that the gold standard summaries had not
been formulated with extractive constraints in mind, resulting in a mismatch between the content
selected by extractive systems and the more paraphrastic or restructured gold summaries. This inherent
limitation reduced their ability to match the semantic richness captured by abstractive models,
particularly those leveraging LLMs capable of cross-lingual generalization and deeper contextual understanding.</p>
        <p>Performance by model size</p>
        <p>Larger models (e.g., pjmathematician’s 32B Qwen3 [35]) generally achieved the highest ROUGE-F1
and BERT-F1 scores across most tracks. However, smaller models with tailored fine-tuning (e.g., BU
team’s 0.5B Qwen2.5 with distillation and quantization [36]) performed competitively, indicating the
efectiveness of lightweight architectures when combined with targeted training strategies. In particular,
the two-stage distillation framework enabled the smaller model to internalize domain-relevant concepts,
improving content selection and summary coherence without increasing inference-time complexity.</p>
        <p>Although performance did not scale linearly with model size, larger models generally produced
better results even with less sophisticated approaches. However, the performance gains were not
substantial enough to render smaller models obsolete, highlighting their potential for deployment in
low-resource or high-latency environments.</p>
        <p>Performance by model architecture</p>
        <p>A notable trend in recent summarization research is the increasing dominance of decoder-only
architectures over the traditional encoder-decoder models like BART and T5. While encoder-decoder
frameworks were initially considered the gold standard for sequence-to-sequence tasks due to their
explicit separation of input comprehension and output generation, the landscape has shifted as
largescale pretraining and instruction tuning have unlocked the generative power of decoder-only models.</p>
        <p>This shift is also reflected in the implementations presented. Of all 8, one is an encoder-only, while
the rest are all decoder-only.</p>
        <p>Prompt engineering strategies</p>
        <p>The increasing dominance of decoder-only architectures in the task of summarization has placed
prompt engineering at the forefront of model control and output quality. Unlike encoder-decoder
frameworks, which typically rely on supervised fine-tuning, decoder-only models such as GPT, Qwen,
or LLaMA depend heavily on prompt design to steer generation. As a result, the quality of prompts
emerged as a key diferentiator among submissions, with participants adopting diverse strategies to
best guide the model towards the expected results.</p>
        <p>One notable approach was automated prompt optimization, exemplified by pjmathematician [35],
who leveraged a fully automated pipeline based on Qwen3 (32B) to generate prompts without manual
intervention. This allowed for scalable multilingual summarization and removed potential biases. MaLei
[37] applied a related strategy, employing Iterative Self-Prompting (ISP) with GPT-4, where prompts
were refined across multiple rounds to enhance semantic alignment and factuality, particularly for
English clinical summaries. Both papers include the final, optimized system prompts.</p>
        <p>Contextual prompt augmentation was another efective method. Team MedCOD [42] automatically
extracted medical keywords and injected them into the prompts to enrich domain specificity across five
languages (they added German). This structured augmentation yielded significant improvements in
non-English texts when compared to simpler summarization prompts, increasing F1 scores for both
BertScore and Rouge by around 0.15. The authors attribute this increase to both a language anchor and
a domain signal.</p>
        <p>A more modular and dynamic strategy was adopted by JohannaUE [40], who introduced hybrid
prompt structures. Their multi-agent pipeline integrated Named Entity Recognition (NER) outputs and
knowledge graph summaries into the prompt, aligning extracted factual content with the abstractive
generation phase. This design allowed for prompt composition to adapt based on both language and
content type.</p>
        <p>Both ÉTS-PUCPR [41] and grazhdanski [39] tried to force their models to focus on relevant clinical
information. They both assigned a role (clinical assistant), goal (summarization) and structural
constraints (tone and format), with the former also adding some examples, making it few-shot as opposed
to zero-shot.</p>
        <p>Interestingly, some teams—such as BU Team and ExtraSum—chose not to use prompt engineering
and instead relied on lightweight fine-tuning of models. Their results, while competitive in certain
languages, highlighted the trade-ofs between static model tuning and dynamic prompt-based control.</p>
        <p>Overall, the submissions demonstrated that prompt engineering can not only complement fine-tuning
but in some cases substitute it.</p>
        <p>Efect of automatic Machine Translation</p>
        <p>The use of automatic machine translation to generate data in Spanish, French and Portuguese
from the English PMC-Patients subset of clinical cases is likely to impact system performance across
the translation languages. While translation ensured consistent content, it may have introduced
issues such as unnatural phrasing, tokenization inconsistencies, or loss of medical detail, particularly
notable in languages with less comprehensive medical translation resources. These dificulties could
have impacted both model training and evaluation metrics, especially ROUGE, which is sensitive to
lexical overlap. However, the solid performance observed in some teams submission (e.g., ÉTS-PUCPR,
grazhdanski, MedCOD) suggests that with appropriate adaptation methods, translated data can still
support competitive summarization performance in low-resource languages.</p>
        <p>L M</p>
        <p>M P
- M</p>
        <p>M
M</p>
        <p>M
D G</p>
        <p>G C - G</p>
        <p>G</p>
        <p>G G
o se o o</p>
        <p>Y N N
n
i
F Y Q N Y z - N Y z Y L Y
ian lra</p>
        <p>e
l
a e
i
l
i
t
r
a .5 .1
o
r 4 2
P 0 0 4 - p 1 8 3
i
t )
a
r</p>
        <p>x
g te )
n
e i
It t
( p (
m</p>
        <p>n
n on rso
o i
C ta p
e
(
,
t
(D iza
t
i
t
n
a
u
Q
,
R
o )
L n
t
p
a
m
m
e
T G
ER ed</p>
        <p>e
ev iv
it tc
c a
ra tr
t s
x b</p>
        <p>E A
.
s
e
i
g
o
l
o
d
o
h
t
e
t
n
a
f
o
y
e B</p>
        <p>5
m .0
a</p>
        <p>n .</p>
        <p>5
l 2
e
d
o
n
e
e e</p>
        <p>v
p i
y t
t c
.</p>
        <p>r
t
m s</p>
        <p>a
u b
m S A
s
p
i
c
it e
r
a m
p a</p>
        <p>N m
a
e
m t
a
e U
r
8 a
le m T B
b m
a u
T S
w
w</p>
        <p>w
M Q</p>
        <p>B M</p>
        <p>Q</p>
        <p>Q
R</p>
        <p>P
m C
u U
raS -P
t S
x T
E É</p>
        <p>E
U
a
n i
n e
a L
h a
Jo M</p>
        <p>D n
O a
C
d
e
a m
r j</p>
        <p>M g p
s
u
i
o 4
r T
a P
V G
e
v
i
t
c
ird ra
b ts
y b
H A</p>
        <p>A</p>
        <p>A
a
t
e
i
r
p
l
a
u
g
n
i
l
i
t
l
u
l
a
r
e
n
e
n
I
l
a
u
t
4
i
h
P
/
t
c
u
r
t
s
n
I
1
.
e
v
i
t
c
a
r
t
s
b
i
k
s
d
h
z</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>The MultiClinSum shared task has led to the development and evaluation of novel systems for
automatic clinical case summary generation for multiple languages, that is four languages: English,
Spanish, French, and Portuguese. Together, these languages represent over 1.198 million native speakers
worldwide, complementing most previous medical summarization shared tasks and eforts, which have
primarily focused only on English content. Since many medical records and a significant number of
clinical case reports are not written in English, it is becoming increasingly important to promote the
development of NLP and generative AI systems that can handle non-English data or are inherently
capable of multilingual processing.</p>
      <p>Overall, the MultiClinSum task received a wide range of solutions and strategies explored by the
participating teams with a total of ten team submissions. The evaluation was conducted using both
lexical (ROUGE-Lsum) and semantic (BERTScore) metrics, reflecting the complex requirements of
clinical summarization. Across all languages, the best-performing systems demonstrated strong
semantic alignment with gold standard summaries, as captured by BERTScore-F1, while some systems
exhibited a trade-of with surface-level overlap (ROUGE scores).</p>
      <p>In the English sub-track, pjmathematician [35] achieved the best BERTScore-F1 score, while seemdog
led in ROUGE metrics. The BU Team [36] also performed strongly with a Qwen2.5-based distillation
framework, consistently ranking among the top. In Spanish, team grazhdanski [39] obtained the highest
overall scores using GRPO, a reinforcement learning method optimized with ROUGE-L and BERTScore
rewards. For French, the BU Team [36] again led, confirming the robustness of their multilingual
training strategy. For Portuguese, Team ÉTS-PUCPR [41] outperformed others in semantic metrics
using a fine-tuned, domain-specific model (MedGemma) adapted with LoRA, while Salim’s MedCOD
[42] framework also showed competitive results through keyword-based prompting.</p>
      <p>Top-performing systems combined eficient adaptation techniques (e.g., LoRA, prompting,
distillation) with strong domain alignment, illustrating the importance of lightweight, semantically
focused models in clinical summarization across the languages. Systems’ performance varied
noticeably across languages sub-tracks, reflecting both resource availability and linguistic complexity.
English, benefiting from abundant training data and model support, showed overall higher scores
for both BERTScore and ROUGE metrics. Spanish and French sub-tracks demonstrated decent
performance, with certain systems like GRPO (Spanish) and MedCOD (French) achieving strong results
despite limited resources. Portuguese showed the largest performance gap, though domain-specific
ifne-tuning (as in Team ÉTS-PUCPR’s MedGemma-Sum-Pt [41]) closed the gap significantly in semantic
metrics. These diferences highlight the importance of language-specific strategies, as well as the
advantages of domain adaptation and eficient fine-tuning, particularly in under-resourced clinical settings.</p>
      <p>From a clinical perspective, automatically generated summaries must contain key medical
information essential for decision-making and case understanding. This includes the patient’s primary
diagnosis, relevant clinical observations, treatments, outcomes, and follow-up recommendations. In
the context of case reports, an accurate summarisation of these elements is essential to ensure that
clinicians can rapidly assess the core narrative without missing critical details. So systems that focus on
these aspects through approaches detailed in previous sections, demonstrate remarkable potential for
real-world integration into clinical documentation and decision-support workflows.</p>
      <p>Clinical case reports can vary widely, covering heterogeneous scenarios from rare genetic disorders
to complex surgeries and emerging diseases. This particular issue presents a challenge to automatic
summarisation systems, requiring them to be capable of handling diverse contents, terminologies,
and levels of detail. The capture of key clinical insights without the loss of unusual details requires
adaptable models and robust domain understanding, especially in highly specialized or uncommon cases.</p>
      <p>With regard to the author provided summaries, while these are valuable for creating gold standard
datasets, they often vary in quality and consistency. The presence of a diversity of writing styles,
emphasis, and completeness may result in redundancy in the summaries, the introduction of subjective
interpretations, or an emphasis on narrative flow rather than structured clinical content, which may
have an impact on the training and evaluation processes. These inconsistencies highlight the need for
careful curation and potentially supplementary annotation to ensure reliable benchmarks for automatic
summarization models.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and outlook</title>
      <p>The MultiClinSum shared task addressed the growing need for efective automatic summarization of
clinical case reports across multiple languages. By providing a multilingual dataset covering English,
Spanish, French, and Portuguese languages, the task enabled the community to explore and evaluate a
wide range of summarization strategies using both gold standard and large-scale datasets. Participating
teams employed diverse methodologies, from prompt optimization with powerful LLMs to fine-tuning
multilingual architectures. The partipant system results showed strong performance overall, with
English models achieving the highest BERTScore F1, and competitive results observed for the other
languages. By releasing the MultiClinSum resources to the public, we aim to promote further research
in this field, fostering innovation in multilingual clinical summarization NLP techniques and supporting
the development of tools that can improve healthcare and medical research on a global scale.</p>
      <p>As with discharge summaries, clinical case reports are particularly enriched with specific types
of clinical entities or concepts relevant to patient demographics (e.g., age, gender, or occupation),
clinical presentation (e.g., findings, signs, and symptoms), diagnosis (e.g., disorders and diseases),
intervention (e.g., clinical procedures and medications), as well as outcome and follow-up. Therefore,
further examination of the impact and relevance of these clinical entities for automatic summarization
approaches should be considered in future research scenarios and summarization system developments.
The release of semantically enriched or entity-annotated full clinical case reports and their corresponding
summaries might constitute a valuable resource to foster such future developments and is planned
for upcoming MultiClinSum data releases. A more granular evaluation setting, or specific subtasks
based on the type of case report or clinical specialty, could provide deeper insights into clinical case
summarization performance. Other aspects that would require further analysis relate to the robustness
and potential biases of automatic clinical text summarization strategies with respect to patient sex or
gender.</p>
      <p>We foresee that the MultiClinSum task may help promote future initiatives and eforts to implement
and evaluate clinical text summarization solutions across multiple languages—not only in English,
Spanish, French, and Portuguese. There is a pressing need to systematically collect and release
multilingual full-text clinical documents along with their corresponding summaries, in order to enable more
efective benchmarking and exploitation of generative AI and large language models (LLMs) for text
summarization tasks.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The MultiClinSum track was funded by Spanish and European projects such as DataTools4Heart (Grant
Agreement No. 101057849), AI4HF (Grant Agreement No. 101080430). This publication is part of the
R &amp;D &amp;I project TED2021-129974B-C22, funded by MICIU/AEI /10.13039 /501100011033 and by the
European Union NextGenerationEU/PRTR (BARITONE (Proyectos de Transición Ecológica y Transición
Digital 2021). We would also like to acknowledge the scientific committee members, Sophia Ananiadou,
Horacio Saggion and Simon Mille for their valuable feedback and suggestions regarding the task settings
and evaluation scenarios as well as the BioASQ organizers and specially Anastasios Nentidis for their
technical support.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used OpenAI-GPT-4.1 to: enhance the grammar and
paraphrasing. This was followed by a review and edit of the content, with the author(s) taking full
responsibility for the publication.
(Eds.), Proceedings of the Seventh Conference on Machine Translation (WMT), Association for
Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid), 2022, pp. 634–645. URL:
https://aclanthology.org/2022.wmt-1.60/.
[35] P. Vachharajani., pjmathematician at multiclinsum 2025: A novel automated prompt optimization
framework for multilingual clinical summarization., 2025.
[36] J. C. Nicolay Rusnachenko, Xiaoxiao Liu, J. J. Zhang, Using decoder-based distillation for enhancing
multilingual clinical case report summarization, 2025.
[37] Y. M. N. Libo Ren, L. Han., Malei at multiclinsum: Summarisation of clinical documents using
perspective-aware iterative self-prompting with llms., 2025.
[38] S. C.-D. Soukaina Rhazzafe, N. S. Nikolov., Multiclinsum: Extractive summarization of english,
spanish, french and portuguese clinical case reports, 2025.
[39] G. Grazhdanski., Group relative policy optimization for spanish clinical case report summarization.,
2025.
[40] J. Angulo, V. Y. Agentic., Agentic mcs: A multilingual clinical summarization framework., 2025.
[41] E. C. P. A. S. B. J. Elisa Terumi Rubel Schneider, Fernando Henrique Schneider, R. M. O. Cruz.,</p>
      <p>Medgemma-sum-pt: A lightweight model for portuguese clinical summarization, 2025.
[42] A. A. R. S. K. Z. Y. Md Shahidul Salim, Lianne Fu, H. Yu., Enhancing multilingual medical
summarization via contextual keyword augmentation., 2025.
[43] Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei,
H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang,
L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y. Fan,
Y. Su, Y. Zhang, Y. Wan, Y. Liu, Z. Cui, Z. Zhang, Z. Qiu, Qwen2.5 technical report, 2025. URL:
https://arxiv.org/abs/2412.15115. arXiv:2412.15115.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fiszman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Weir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jonnalagadda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mostafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Del Fiol</surname>
          </string-name>
          ,
          <article-title>Text summarization in the biomedical domain: a systematic review of recent research</article-title>
          ,
          <source>Journal of biomedical informatics 52</source>
          (
          <year>2014</year>
          )
          <fpage>457</fpage>
          -
          <lpage>467</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Golob</surname>
          </string-name>
          Jr,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Como</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Claridge</surname>
          </string-name>
          ,
          <article-title>The painful truth: The documentation burden of a trauma surgeon</article-title>
          ,
          <source>Journal of Trauma and Acute Care Surgery</source>
          <volume>80</volume>
          (
          <year>2016</year>
          )
          <fpage>742</fpage>
          -
          <lpage>747</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sinsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Colligan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Prgomet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Westbrook</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tutty</surname>
          </string-name>
          , G. Blike,
          <article-title>Allocation of physician time in ambulatory practice: a time and motion study in 4 specialties</article-title>
          ,
          <source>Annals of internal medicine 165</source>
          (
          <year>2016</year>
          )
          <fpage>753</fpage>
          -
          <lpage>760</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Yackel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Embi</surname>
          </string-name>
          ,
          <article-title>Unintended errors with ehr-based result management: a case series</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          <volume>17</volume>
          (
          <year>2010</year>
          )
          <fpage>104</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>R. J. FitzGerald</surname>
          </string-name>
          ,
          <article-title>Medication errors: the importance of an accurate drug history</article-title>
          ,
          <source>British journal of clinical pharmacology 67</source>
          (
          <year>2009</year>
          )
          <fpage>671</fpage>
          -
          <lpage>675</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Van Veen</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Van Uden</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Blankemeier</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Delbrouck</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Aali</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bluethgen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Pareek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Polacin</surname>
            ,
            <given-names>E. P.</given-names>
          </string-name>
          <string-name>
            <surname>Reis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Seehofnerova</surname>
          </string-name>
          , et al.,
          <article-title>Clinical text summarization: adapting large language models can outperform human experts, Research square (</article-title>
          <year>2023</year>
          <article-title>) rs-3.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Van Veen</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Van Uden</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Blankemeier</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Delbrouck</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Aali</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bluethgen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Pareek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Polacin</surname>
            ,
            <given-names>E. P.</given-names>
          </string-name>
          <string-name>
            <surname>Reis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Seehofnerová</surname>
          </string-name>
          , et al.,
          <article-title>Adapted large language models can outperform medical experts in clinical text summarization</article-title>
          ,
          <source>Nature medicine 30</source>
          (
          <year>2024</year>
          )
          <fpage>1134</fpage>
          -
          <lpage>1142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bednarczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Reichenpfader</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gaudet-Blavignac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Ette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bensahla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bjelogrlic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lovis</surname>
          </string-name>
          ,
          <article-title>Scientific evidence for clinical text summarization using large language models: Scoping review</article-title>
          ,
          <source>Journal of Medical Internet Research</source>
          <volume>27</volume>
          (
          <year>2025</year>
          )
          <article-title>e68998</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chaves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kesiku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Garcia-Zapirain</surname>
          </string-name>
          ,
          <article-title>Automatic text summarization of biomedical text data: a systematic review</article-title>
          ,
          <source>Information</source>
          <volume>13</volume>
          (
          <year>2022</year>
          )
          <fpage>393</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mostafa</surname>
          </string-name>
          ,
          <article-title>A systematic review of automatic text summarization for biomedical literature and ehrs</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          <volume>28</volume>
          (
          <year>2021</year>
          )
          <fpage>2287</fpage>
          -
          <lpage>2297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Keszthelyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gaudet-Blavignac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bjelogrlic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lovis</surname>
          </string-name>
          , et al.,
          <article-title>Patient information summarization in clinical settings: scoping review</article-title>
          ,
          <source>JMIR Medical Informatics</source>
          <volume>11</volume>
          (
          <year>2023</year>
          )
          <article-title>e44639</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Biobert: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <year>2019</year>
          )
          <fpage>1234</fpage>
          -
          <lpage>1240</lpage>
          . URL: http://dx.doi.org/10.1093/bioinformatics/btz682. doi:
          <volume>10</volume>
          .1093/bioinformatics/btz682.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altosaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ranganath</surname>
          </string-name>
          ,
          <source>Clinicalbert: Modeling clinical notes and predicting hospital readmission</source>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1904</year>
          .05342. arXiv:
          <year>1904</year>
          .05342.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Poon, T.-Y. Liu,
          <article-title>Biogpt: generative pre-trained transformer for biomedical text generation and mining</article-title>
          ,
          <source>Briefings in Bioinformatics</source>
          <volume>23</volume>
          (
          <year>2022</year>
          ). URL: http://dx.doi.org/10.1093/bib/bbac409. doi:
          <volume>10</volume>
          .1093/bib/bbac409.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rhazzafe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Carafini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Colreavy-Donnelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dhassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. S.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <article-title>Hybrid summarization of medical records for predicting length of stay in the intensive care unit</article-title>
          ,
          <source>Applied Sciences</source>
          (
          <year>2024</year>
          ). URL: https://www.mdpi.com/2076-3417/14/13/5809. doi:
          <volume>10</volume>
          .3390/app14135809.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>J.-B. Delbrouck</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Varma</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Chambon</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>Langlotz, Overview of the radsum23 shared task on multi-modal and multi-anatomical radiology report summarization</article-title>
          ,
          <source>in: The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>478</fpage>
          -
          <lpage>482</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>J. DeYoung</surname>
          </string-name>
          , I. Beltagy, M. van
          <string-name>
            <surname>Zuylen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Kuehl</surname>
            ,
            <given-names>L. L.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , Ms^
          <article-title>2: Multi-document summarization of medical studies</article-title>
          ,
          <source>in: EMNLP</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Soboczenski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Marshall</surname>
          </string-name>
          ,
          <article-title>Generating (factual?) narrative summaries of rcts: Experiments with neural multi-document summarization</article-title>
          ,
          <source>AMIA Annual Symposium abs/2008</source>
          .11293 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dligach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Churpek</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Afshar, Overview of the problem list summarization (probsum) 2023 shared task on summarizing patients' active diagnoses and problems from electronic health record progress notes</article-title>
          ,
          <source>in: Proceedings of the conference. Association for Computational Linguistics. Meeting</source>
          , volume
          <year>2023</year>
          ,
          <year>2023</year>
          , p.
          <fpage>461</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Pmc-patients: A large-scale dataset of patient summaries and relations for benchmarking retrieval-based clinical decision support systems</article-title>
          ,
          <source>arXiv preprint arXiv:2202.13876</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>C.-Y. Lin</surname>
            ,
            <given-names>ROUGE:</given-names>
          </string-name>
          <article-title>A package for automatic evaluation of summaries</article-title>
          , in: Text Summarization Branches Out,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <article-title>Meteor: an automatic metric for mt evaluation with high levels of correlation with human judgments</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Statistical Machine Translation</source>
          , StatMT '07,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, USA,
          <year>2007</year>
          , p.
          <fpage>228</fpage>
          -
          <lpage>231</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with bert</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , G. Neubig, P. Liu, Bartscore:
          <article-title>Evaluating generated text as text generation</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2106.11520. arXiv:
          <volume>2106</volume>
          .
          <fpage>11520</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellam</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
          </string-name>
          ,
          <article-title>BLEURT: Learning robust metrics for text generation</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>7881</fpage>
          -
          <lpage>7892</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>704</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>704</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>D.</given-names>
            <surname>Fraile Navarro</surname>
          </string-name>
          , E. Coiera,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hambly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Triplett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Asif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Susanto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lorenzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Berkovsky</surname>
          </string-name>
          ,
          <article-title>Expert evaluation of large language models for clinical dialogue summarization</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>15</volume>
          (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .1038/s41598-024-84850-x.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>G.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zucker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Elhadad</surname>
          </string-name>
          ,
          <article-title>A meta-evaluation of faithfulness metrics for long-form hospitalcourse summarization</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2303.03948. arXiv:
          <volume>2303</volume>
          .
          <fpage>03948</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1907</year>
          . 11692. arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>T.</given-names>
            <surname>Nissen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wynn</surname>
          </string-name>
          ,
          <article-title>The clinical case report: a review of its merits and limitations</article-title>
          ,
          <source>BMC research notes 7</source>
          (
          <year>2014</year>
          )
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Gagnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kienle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Riley</surname>
          </string-name>
          ,
          <article-title>The care guidelines: consensusbased clinical case reporting guideline development</article-title>
          ,
          <source>Global advances in health and medicine</source>
          <volume>2</volume>
          (
          <year>2013</year>
          )
          <fpage>38</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kidd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hubbard</surname>
          </string-name>
          ,
          <source>Introducing journal of medical case reports</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gonzalez-Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pàmies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Llop</surname>
          </string-name>
          , I. Baucells,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Dalt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tamayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Saiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Espuña</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Prats</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aula-Blasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rubio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shvets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sallés</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Lacunza</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Pikabea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Palomar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Falcão</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tormo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vasquez-Reina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marimon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ruíz-Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Villegas</surname>
          </string-name>
          ,
          <source>Salamandra technical report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2502.08489. arXiv:
          <volume>2502</volume>
          .
          <fpage>08489</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Treviso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Guerreiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zerva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Farinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Maroti</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. G. C. de Souza</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Glushkova</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Alves</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Coheur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lavie</surname>
            ,
            <given-names>A. F. T.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          ,
          <article-title>CometKiwi: IST-unbabel 2022 submission for the quality estimation shared task</article-title>
          , in: P.
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Barrault</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Bojar</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Bougares</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Chatterjee</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          <string-name>
            <surname>Costa-jussà</surname>
            , C. Federmann,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Fishel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fraser</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Freitag</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Graham</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Grundkiewicz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Guzman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Huck</surname>
            ,
            <given-names>A. Jimeno</given-names>
          </string-name>
          <string-name>
            <surname>Yepes</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Kocmi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Morishita</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Monz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nagata</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Nakazawa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Negri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Névéol</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Neves</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Popel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Turchi</surname>
          </string-name>
          , M. Zampieri
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>