<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MaLei at MultiClinSUM: Summarisation of Clinical Documents using Perspective-Aware Iterative Self-Prompting with LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Libo Ren</string-name>
          <email>renlibo994@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yee Man Ng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lifeng Han</string-name>
          <email>l.han@lumc.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leiden Institute of Advanced Computer Science (LIACS), Leiden University</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leiden University Medical Center</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Manchester</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Eficient communication between patients and clinicians plays an important role in shared decision-making. However, clinical reports are often lengthy and filled with clinical jargon, making it dificult for domain experts to identify important aspects in the document eficiently. This paper presents the methodology we applied in the MultiClinSUM shared task for summarising clinical case documents. We used an Iterative Self-Prompting technique on large language models (LLMs) by asking LLMs to generate task-specific prompts and refine them via example-based few-shot learning. Furthermore, we used lexical and embedding space metrics, ROUGE and BERT-score, to guide the model fine-tuning with epochs. Our submission using perspective-aware ISP on GPT-4 and GPT-4o achieved ROUGE scores (46.53, 24.68, 30.77) and BERTscores (87.84, 83.25, 85.46) for (P, R, F1) from the oficial evaluation on 3,396 clinical case reports from various specialties extracted from open journals. The high BERTscore indicates that the model produced semantically equivalent output summaries compared to the references, even though the overlap at the exact lexicon level is lower, as reflected in the lower ROUGE scores. This work sheds some light on how perspective-aware ISP (PA-ISP) can be deployed for clinical report summarisation and support better communication between patients and clinicians.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Shared Decision Making</kwd>
        <kwd>Health Literacy</kwd>
        <kwd>Patient Communication</kwd>
        <kwd>LLMs</kwd>
        <kwd>Clinical Summarisation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Eficient and efective communications between patients and healthcare professionals play an important
role in patient care [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. However, healthcare providers frequently have to read many clinical documents
in a short time frame to understand the current patients. This is challenging, as clinical documents of
patients include rich information on patients’ problems, diagnoses, treatments, progressions, and side
efects. Similarly, patients often do not have the clinical expertise to fully understand the lengthy clinical
documents about their health issues. A concise and accurate summarisation of clinical documents will
save the time of healthcare professionals to understand the problem at hand, as well as help patients to
understand their health conditions better and earlier. We attended the multilingual clinical documents
summarisation shared task (MultiClinSUM) to explore large language models (LLMs) for the use of this
challenge.
      </p>
      <p>
        The MultiClinSUM Track is organised by the Barcelona Supercomputing Center’s NLP for Biomedical
Information Analysis group and promoted by Spanish and European projects such as DataTools4Heart,
AI4HF, and BARITONE [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>To utilize the current state-of-the-art development from natural language processing (NLP), we
investigated Iterative Self-Prompting (ISP) [4] in GPT-4 and ChatGPT for automatic summary of clinical
documents. In this methodology, we ask the LLMs to generate prompts themselves to approach this
task by detailed instruction and example-based learning; the generated automatic summary is returned
to the LLMs with example references to ask the LLM to refine the prompts for this specific task. The ISP
technique has proved to be very useful and eficient in leveraging LLMs to generate a clear summary that
includes patients’ symptoms, diagnoses, treatments, and outcomes/follow-ups, in a previous shared task
for healthcare answer summarization [4]. We use the datasets provided by the shared task organizers,
which contain clinical case reports and summaries that are written by healthcare providers and in various
languages, including English, Spanish, French, and Portuguese. The English test set under investigation
contains 3,396 clinical documents. We perform both quantitative and qualitative evaluations, through
human and automatic metrics, including ROUGE and BERTscore. ROUGE is a lexical overlapping metric,
and BERTscore is an embedding space semantic similarity metric.</p>
      <p>Quantitative evaluation scores show that while our system output has a ROUGE F1 score of 0.31
compared to the reference on exact lexicon matches, the semantic BERTscore shows 0.85 F1, which
indicates high quality of semantic meaning preservation. This sheds good light on the potential usage
of LLMs for the summarisation of clinical documents using the ISP technique. The qualitative analysis
confirms that the generated summaries tend to cover the key clinical aspects and contain logical
paraphrasing. We also carried out error analysis to see in what ways LLMs produce undesirable results,
such as the considerable length of the summaries. Another interesting finding from this shared task is
that LLMs tend to generate longer text to comment on missing data when the clinical document is too
short, such as “The case report does not provide specific details on the outcome or follow-up. Typically,
such a patient would require close monitoring and treatment adjustments based on laboratory and
clinical responses.” This also provides some insight into the clarity of current clinical documents/reports.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Clinical NLP</title>
        <p>Clinical NLP has drawn attention from both NLP and healthcare researchers in recent years due to
the development and efectiveness of modern NLP models and the eagerness to test such AI models
in healthcare domains. For instance, there is the ClinicalNLP workshop series from 2016 (6th edition
in 2024) [5]. A related track of ClinicalNLP WS is the Biomedical NLP international workshop events
(BioNLP) from 2004 [6, 7].</p>
        <p>The corresponding tasks have included clinical text translation [8], biomedical abstract simplification
[9, 10], clinical events recognition [11], temporal relation extraction [12, 13], entity linking/normalisation
to SNOMED CT and British National Formulary (BNF) codes [14], synthetic data
generation/augmentation [15, 16], patient sensitive information de-identification [ 17], and healthcare answer summarisation
[4], etc. These works have explored methodologies from diferent paradigms, such as fine-tuning
encoder-based models, training encoder-decoder models, and prompting decoder-only models using
diferent techniques.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Healthcare Data Summarisation</title>
        <p>For the clinical documents summarisation task, the most relevant work includes the shared task on
Perspective-Aware Healthcare Answer Summarisation (PerAnsSumm 2025) [18]. This shared task
included the summarisation of online forum healthcare answers while considering the diferent
perspectives, i.e., types of information such as ‘Cause” or ‘Suggestion’, within an answer. In this shared task, we
used Iterative Self-Prompting (ISP) with Claude and o1 for Perspective-aware Healthcare Answer
Summarisation, as described in [4, 18]. Similar to clinical documents, online forum responses vary greatly
in length. The key diference is that clinical documents often contain clinician-specific abbreviations
and jargon, which pose challenges for NLP models to interpret. In contrast, online forum data typically
CoT
Metricsguide</p>
        <p>Patient Presentation
Clinical Presentation
Treatment/Intervention
Outcome and Follow-up</p>
        <p>refinement</p>
        <p>Prompt
generation</p>
        <p>Input
examples
system
outputs
references
Summary</p>
        <p>Task
instruction</p>
        <p>Perspective</p>
        <p>Diagnosis
includes social media-style writing, such as frequent spelling mistakes and grammatical errors. In this
work, we build upon our experience from the PerAnsSumm shared task to design perspective-aware
summaries for clinical documents.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Overview of the Prompting Framework</title>
        <p>There are mainly three clinical datasets involved in this project, as listed below:
• multiclinsum_gs_train_en: contains 592 gold-standard samples, which are manually annotated
and consist of a full-text and summary pair.
• multiclinsum_large-scale_train_en: contains 25,902 full-text and summary pairs. Their quality is
slightly lower than that of the 592 gold samples, but they are still useful for data augmentation.
• multiclinsum_test_en: the English test set, which includes 3,396 full-text cases but without any
summaries.</p>
        <p>We mainly adopt the Iterative Self-Prompting (ISP) strategy in this task. As shown in Figure 1, we
construct the meta-prompt based on the combination of Chain-of-Thought (CoT) instructions, clinical
perspectives, and metric-based guidance. The meta-prompt is provided to the LLM along with a few
few-shot examples at the beginning.</p>
        <p>Based on this meta-prompt and the examples, the LLM is instructed to generate a new task-specific
prompt that guides the clinical summarisation process more efectively. Using this synthetic prompt
(prompt_v1), we input it together with clinical full-texts from a portion of the golden training set into
the model to generate corresponding summaries. These synthetic summaries are then compared with
the ground-truth summaries from the golden data, and evaluation scores—as well as reflective feedback
and suggestion advice—are produced accordingly. This feedback serves as a reference for further prompt
refinement, allowing us to iteratively update the prompt to obtain prompt_v2, _v3, and so on.</p>
        <p>This prompt updating process is repeated until no obvious performance improvement is observed.
Once the improvement plateaued, we planned to augment the structure using additional spans extracted
from the remaining gold data and to apply a retrieval-based (RAG) technique using the 25,902
nongolden training samples. After all experiments, the best-performing prompt version is used to generate
the final clinical summaries on the test set.</p>
        <p>Unfortunately, due to time constraints, we were only able to complete the steps before the structure
augmentation phase (as detailed in Section 3.4). As a result, we selected the best-performing prompt at
that point and used it directly for inference on the test set. The remaining experimental designs can be
explored in future work.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Prompt Initialisation and Few-shot Setup</title>
        <p>As shown in Figure 1, we first construct an initial instruction that prompts the LLMs about the summary
task description with the chain of thoughts (CoTs) on how it shall think, for example:
• What common structure or patterns do you observe in the examples?
• What information is emphasised?
• How can a language model be guided to produce similar quality outputs?
• What errors should be avoided?
These CoTs are combined with perspective-based, i.e., multifaceted, structural guidance and metric-based
feedback to inform the LLM’s generation process.</p>
        <p>The Perspectives we designed include:
1. Patient Presentation: age, sex, relevant history.
2. Clinical Presentation: key symptoms and signs.
3. Diagnosis: relevant investigations, tests, conclusions.
4. Treatment/Intervention: medications, surgeries, therapies.
5. Outcome and Follow-up: results of treatment, current status.</p>
        <p>The metrics we used are ROUGE-L (for lexical overlap) and BERTScore (for embedding semantic
ifdelity). The instructions, structural perspectives, evaluation metrics, and three representative examples
collectively form the meta-prompt, which is then used to generate the initial prompt. An example of
the meta-prompt is provided in Appendix B.1.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Prompt Iterative Update and Refinement</title>
        <p>We selected a small batch of 50 full-text and summary pairs to train the prompt-interactive model. Note
that data points indexed from 4 to 53 were used at this stage, as the first three samples had already been
used for the initial prompt generation as few-shot examples.</p>
        <p>In each epoch, we not only generated summaries based on the full texts using the initial prompt
but also compared the synthetic summaries with the gold-standard annotations, asking GPT-4o to
provide reflections and revision suggestions. For each generated summary, we computed both
ROUGEL and BERTScore, and requested reflections and revision suggestions from the model behind their
performance. We found that BERTScore remained relatively stable (consistently above 0.85), while
ROUGE-L scores fluctuated significantly, ranging from 0.12 to 0.52. Therefore, our optimisation eforts
focused on improving ROUGE-L.</p>
        <p>To iteratively update the prompt while balancing performance and computational cost, we selected
a small subset of 15 summaries with the lowest ROUGE-L scores and included their corresponding
evaluation feedback. These were used as new few-shot examples to guide the prompt refinement.
The LLM was then instructed to revise the prompt by integrating the previous version along with the
reflection and suggestion content of these samples. The prompt used in this process is shown in the
Appendix B.2 Figure 9.</p>
        <p>We conducted five epochs of this process. For the initial version (prompt_v1), one summary was
found to be invalid for evaluation scoring, with an overall BERTScore of 0.86 and a ROUGE-L of 0.30
among the remaining 49 full-text and summary pairs. Interestingly, after the prompt updates, although
the evaluation scores did not improve significantly, the invalid case was resolved. In other words, the
BERTScore continued to fluctuate around 0.86, and the ROUGE-L around 0.30, across all 50 full-text
and summary pairs. As a result, we adopted prompt_v2 as the best-performing version—the first in
which invalid predictions were eliminated.
To further improve structural consistency, the remaining 539 gold summaries were intended for
extracting common phrases and analysing section-wise linguistic patterns. For example, we aimed to
identify common clinical spans and examine the language style of the gold-standard summaries, e.g.,
phrases like “The patient presented with. . . ” or “Treatments include. . . ”. We also considered measuring
the average length of each paragraph and investigating whether consistent structural patterns could be
observed and used to refine the prompt.</p>
        <p>Based on these insights, the native instructions for the five clinical perspectives could be further
specified. In addition, regular expressions or phrase-matching methods could be employed to capture
high-frequency sentence structures or templates. Sentence patterns that are often overlooked could
also be statistically analysed. Combining these steps may contribute to improving evaluation scores,
particularly the ROUGE-L score.</p>
        <p>Despite several iterations of prompt revision based on reflective feedback, the model outputs did
not demonstrate any improvements in ROUGE-L, indicating that more sophisticated strategies beyond
self-iterative prompt refinement may be required. Owing to time constraints, further exploration in
this area is reserved for future work.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.5. Similar Case Retrieval-based Few-shot Augmentation</title>
        <p>In addition to the gold-standard set, the dataset also includes 51,804 extended clinical cases, each
consisting of a full-text input and its corresponding summary. To enhance test-time generation, we
retrieve cases whose input texts are semantically similar to the current full-test input using sentence
embeddings. SentenceBERT, combined with cosine similarity, is applied at this stage.</p>
        <p>The summaries from the top retrieved cases are inserted before the test input as few-shot
demonstrations, following the same format as the manually selected gold examples at stage 3.2. Typically, the top
3 cases are selected to balance diversity with prompt length constraints. The retrieved summaries are
not included in the evaluation and serve solely as auxiliary input for generation guidance.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.6. Testset Inference</title>
        <p>Once the optimal prompt was selected, the final version was used for testset summarization. The
generation process follows the same setup described in Section 3.2. The main diference is that we
expect the generated summary to be shorter than the original full text. Therefore, we compared the
character lengths of the synthetic summaries and their corresponding full texts and identified cases
where the summary was unexpectedly longer.</p>
        <p>For these cases, we asked the language model to regenerate the summary up to five times. Some
outputs were successfully shortened, while others remained longer than the input. In such cases, we
directly replaced the generated summary with the original full text, assuming that the original text was
already suficiently concise.</p>
        <p>It should be noted that although the full experimental pipeline was initially designed, we did not
proceed with the Full-data Structure-aware Prompt Enhancement and Similar Case Retrieval-based
Few-shot Augmentation (as detailed in Section 3.4 and Section 3.5) due to time constraints. These steps
are planned for future work to further explore their potential for improving performance.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Work and Submission to MultiClinSUM</title>
      <sec id="sec-4-1">
        <title>4.1. Development of the ISP-GPT-4/o model</title>
        <p>For LLMs, we used GPT-4 to generate the initial prompt based on a meta-prompt, and GPT-4o for
summarisation and reflection generation. We split the oficial MultiClinSUM data into several parts
for training around 5 epochs, and selected a well-performing prompt for final test set inference, as
described in Section 3. Figure 2 illustrates the data partitioning strategy and its corresponding usage
in our pipeline. We completed summary generation for 3,396 English full-text cases. Other languages
(Spanish, Portuguese, and French) will be explored in future work.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Submission outcome of MaLei from the shared task</title>
        <sec id="sec-4-2-1">
          <title>4.2.1. Quantitative Results</title>
          <p>
            For the MultiClinSUM2025 shared task we attended [
            <xref ref-type="bibr" rid="ref3">3, 19</xref>
            ], the results of 3,396 submitted English test
summaries are shown in Table 1, and their corresponding Grouped Bar Chart and Overlaid Histogram
are shown in Figure 3 and Figure 4.
          </p>
          <p>At the test set level, as shown in Table 1 and Figure 3, BERTScore is overall more than twice as high as
ROUGE-L, reflecting a similar trend observed in the training set. This suggests that our system achieves
strong semantic preservation while tending to paraphrase the original full text using diferent linguistic
styles. Another notable pattern further supports this. Across both metrics, precision consistently
exceeds recall. This indicates that the synthetic summaries are generally accurate in terms of what they
include, but may lack completeness at a finer-grained level. We speculate that this may be because
GPT4o tends to generate more concise or compressed text. Additionally, the ROUGE-L recall is particularly
low — falling below 0.25 — which implies that the model often uses more varied expressions instead of
preserving the original key phrases, leading to reduced lexical overlap. Therefore, future work could
focus on identifying and preserving fixed phrases and structural patterns in the summary generation
process. In summary, this result aligns with what has been observed in autoregressive models: they
tend to focus more on what to generate rather than on precise word-by-word reproduction.</p>
          <p>At the instance level, as shown in Figure 4, nearly all samples (3374 out of 3396, ≈ 99.35%) have
BERTScores concentrated in the 0.8–0.9 range, indicating a high degree of semantic consistency. In
contrast, most ROUGE-L scores fall between 0.2 and 0.4. The distribution exhibits a clear left-skew and
a long-tail pattern, suggesting that ISP may have inaccurately generalized the structure of the examples
provided to the model, thereby misleading the LLM’s generation. Notably, 380 samples (11.19%) received
ROUGE-L scores below 0.2, while none scored particularly low on BERTScore. This indicates that a
subset of summaries may sufer from issues such as missing critical information, disorganized structure,
or fragmented language.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Qualitative Results</title>
          <p>Generally, the summaries tend to cover key features in the clinical report and contain logical
paraphrasing, which likely led to the high BERTscores, i.e., a high semantic overlap. Figure 5 shows how the
generated summary covers key aspects about the patient, such as the age (82), gender (male), and the
travel history that is relevant for his symptoms. The summary then describes the symptoms (jaundice,
etc.), and the tests and results that have led to a particular diagnosis, followed by the treatment and
outcome. Based on a qualitative analysis with a small sample, we report that most generated summaries
appear to be well-structured according to key features present in human-written clinical reports and
consistent with the original report, which confirms the high BERTscores.</p>
          <p>Comparing the generated summaries during the prompt updating phase and the reference summaries
in the training set, we observe that the generated summaries tend to be longer than the gold-standard
summaries. The generated summaries tend to be more detailed, including details about the specific
tests done and the outcomes of these to reach a particular diagnosis. Figure 6 showcases an example
in which the reference summary is much shorter than the LLM-generated summary. The generated
summary contains more details regarding the diferent treatments that were done previously, while the
reference summary focuses on the main complaint of the patient (“focal hypertrichosis of white hair")
and the treatment that contributed to the patient’s improvement (“discontinued tacrolimus use"). This
suggests that the LLM struggles with discerning key events from details that might be redundant for
domain experts, who may be able to infer what procedures were done from a short and dense summary.
In addition, the generated summaries always introduce the full form of abbreviations, which is not
always the case in gold-standard summaries. These diferences might have led to a lower ROUGE-L
score.</p>
          <p>Furthermore, we found that our prompt design resulted in the LLM including section headers with
every generated summary (see Figure 6 and Figure 7). This led to the generated summaries strictly
adhering to the structure as provided in the prompt, e.g., “Patient presentation”, “Diagnosis”, and
“Treatment”. This explicit structure was generally absent in gold-standard summaries in the training set.
Moreover, the strict adherence to the provided structure might have led to an ordering of key events
and entities that is diferent from the reference summary. While the structure is logical and follows key
features of clinical reports and summaries, it might have negatively impacted the structural overlap
between the generated summary and the reference summary. Therefore, in this case, the inclusion of
the section headers and strict adherence to the structure provided in the prompt might have contributed
to the overall low ROUGE-L score.</p>
          <p>Surprisingly, some generated summaries (12 texts in the test set, out of 3,396) are longer than the
original texts (excluding the section headers). Examining the original clinical reports of this sample, it
appears that the original reports are already quite brief and information-dense. The original texts of
these generated summaries average around 135 words, which is much shorter than the average of 527
words in the entire test set. Close analysis of this sample reveals that the generated texts are mostly a
repetition of the original text rather than a summary of key aspects and events. Figure 7 depicts how
the generated summary copies phrases and sentences literally from the original report, only swapping
a few verbs with close synonyms, e.g., “revealed” instead of “showed”.</p>
          <p>The generated summaries also frequently include observations about missing information in the
report, such as “The case report does not provide specific details on the outcome or follow-up. Typically,
such a patient would require close monitoring and treatment adjustments based on laboratory and
clinical responses.” Figure 7 exemplifies this. This likely negatively impacted the automatic evaluation
scores, but might be useful for domain experts who, in this way, can gain insight into what key
information is missing in the original report.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Prompt-Driven Model Behavior</title>
        <p>With the iteratively updated prompt, the model exhibited the following behaviors:
• Limited ability to compress content, especially in the Treatment and Patient Presentation
sections. This may be due to the lack of content filtering—the model tends to treat all information
equally, failing to prioritise critical conditions and key treatments.
• Structure-guided prompting may induce hallucinations or additional content, particularly
due to the decoder-only architecture. For example, if the prompt asks the model to summarize
the Outcome, but the original text lacks such content, the model may fabricate information like
"regular follow-up was scheduled." It may also lead to the model filling the gaps with statements
such as "[Outcome and follow-up details are not provided in the original text."
• Limited structural flexibility. The model tends to follow the prompt-defined structure too
rigidly, often generating key sentences by copying large portions of the original text with only
minor adjustments. It is also prone to explicitly including section headings based on the focus
points specified in the prompt, which may negatively afect ROUGE-L performance, particularly
when the generated order difers from the gold reference.</p>
        <p>Future work should focus on instructing the model to avoid redundant restatement, introducing
counterfactual constraints to reduce hallucinated content, and developing more flexible structural control,
such as preserving original abbreviations instead of expanding them, and softening the enforcement of
ifxed section headers.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Reflections on Evaluation Metrics</title>
        <p>Although the original evaluation metric, ROUGE-L, efectively captures lexical overlap between the
generated and gold summaries, it has notable limitations and may underestimate summary quality in
certain cases. This is primarily because ROUGE-L is highly sensitive to variations in sentence structure
and phrasing.</p>
        <p>Furthermore, the manually annotated gold summaries are often highly compressed, frequently using
abbreviations and omitting connective phrases. In contrast, the synthetic summaries tend to resemble
patient-facing clinical reports, featuring more complete and explicit expressions. As a result, the
two types of summaries may difer more in style than in substance. A low ROUGE-L score does not
necessarily indicate poor summary quality, as the generated version may convey equivalent medical
content in a diferent form.</p>
        <p>Future work could incorporate metrics that account for structural coverage or introduce clinically
grounded factual checks as complementary evaluation strategies.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>For this shared task on multilingual clinical document summarisation, we used perspective-aware
iterative-self prompting (ISP) on LLMs via GPT4/4o, with the inspiration of the work by [4]. During
the model development, we designed the following perspectives for summarisation, including Patient
Presentation, Clinical Presentation, Diagnosis, Treatment, and Outcome (Follow-up). In conclusion,
the perspective-aware iterative self-prompting (PA-ISP) on LLMs can help summarise lengthy clinical
documents into short summaries while keeping the essence of the clinical knowledge, thus to help
clinicians understand patients’ healthcare history more eficiently, and help the patients to understand
their condition better. The future work will also include lay/plain language adaptation into the
summarisation so that patients with low health literacy level can better understand the clinical records, thus
to improve the communications between patients and healthcare providers for better shared decision
making. Local LLMs will be explored and trained for better privacy preservation. LLM Explainability
and Reasoning are also our ongoing work. In addition, we plan to consider other languages such as the
Spanish data from the shared task, as well as comparing more diverse prompts.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We thank Ida Korfage, Associate Professor at Erasmus MC and co-PI from the 4D Picture Project, for
the valuable comments and revision on the Conference Abstract version of this paper. We thank the
anonymous reviewers for their feedback on improving our article.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4o in order to: Grammar and spelling
check.
semantic indexing and question answering, in: L. P. A. G. S. d. H. J. M. F. P. P. R. D. S. G. F. N. F. Jorge
Carrillo-de Albornoz, Julio Gonzalo (Ed.), Experimental IR Meets Multilinguality, Multimodality,
and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association
(CLEF 2025), 2025.
[4] P. Romero, L. Ren, L. Han, G. Nenadic, The manchester bees at peranssumm 2025: Iterative
self-prompting with claude and o1 for perspective-aware healthcare answer summarisation, in:
Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health), 2025,
pp. 340–348.
[5] T. Naumann, A. Ben Abacha, S. Bethard, K. Roberts, D. Bitterman (Eds.), Proceedings of the
6th Clinical Natural Language Processing Workshop, Association for Computational Linguistics,
Mexico City, Mexico, 2024. URL: https://aclanthology.org/2024.clinicalnlp-1.0/.
[6] N. Collier, P. Ruch, A. Nazarenko (Eds.), Proceedings of the International Joint Workshop on
Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), COLING,
Geneva, Switzerland, 2004. URL: https://aclanthology.org/W04-1200/.
[7] D. Demner-Fushman, S. Ananiadou, M. Miwa, K. Roberts, J. Tsujii (Eds.), Proceedings of the 23rd
Workshop on Biomedical Natural Language Processing, Association for Computational Linguistics,
Bangkok, Thailand, 2024. URL: https://aclanthology.org/2024.bionlp-1.0/.
[8] L. Han, S. Gladkof, G. Erofeev, I. Sorokina, B. Galiano, G. Nenadic, Neural machine translation
of clinical text: an empirical investigation into multilingual pre-trained language models and
transfer-learning, Frontiers in Digital Health 6 (2024) 1211564.
[9] Z. Ling, Z. Li, P. Romero, L. Han, G. Nenadic, Malei at the plaba track of tac-2024: Roberta for task
1–llama3. 1 and gpt-4o for task 2, PLABA at TREC 2024 (2025).
[10] Z. Li, S. Belkadi, N. Micheletti, L. Han, M. Shardlow, G. Nenadic, Investigating large language
models and control mechanisms to improve text readability of biomedical abstracts, in: 2024 IEEE
12th International Conference on Healthcare Informatics (ICHI), IEEE, 2024, pp. 265–274.
[11] S. Belkadi, L. Han, Y. Wu, G. Nenadic, Exploring the value of pre-trained language models for
clinical named entity recognition, in: 2023 IEEE International Conference on Big Data (BigData),
IEEE, 2023, pp. 3660–3669.
[12] H. Tu, L. Han, G. Nenadic, Extraction of medication and temporal relation from clinical text using
neural language models, in: 2023 IEEE International Conference on Big Data (BigData), IEEE,
2023, pp. 2735–2744.
[13] Y. Cui, L. Han, G. Nenadic, Medtem2. 0: Prompt-based temporal classification of treatment events
from discharge summaries, in: Proceedings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 4: Student Research Workshop), 2023, pp. 160–183.
[14] P. Romero, L. Han, G. Nenadic, Medication extraction and entity linking using stacked and voted
ensembles on llms, in: Proceedings of the Second Workshop on Patient-Oriented Language
Processing (CL4Health), 2025, pp. 303–315.
[15] L. Ren, S. Belkadi, L. Han, W. Del-Pinto, G. Nenadic, Synthetic4health: Generating annotated
synthetic clinical letters, Frontiers in Digital Health 7 (2025) 1497130.
[16] L. Ren, S. Belkadi, L. Han, W. Del-Pinto, G. Nenadic, Beyond reconstruction: generating
privacypreserving clinical letters, in: Proceedings of the Sixth Workshop on Privacy in Natural Language
Processing, 2025, pp. 60–74.
[17] A. Paul, D. Shaji, L. Han, W. Del-Pinto, G. Nenadic, Deidclinic: A multi-layered framework for
de-identification of clinical free-text data, arXiv preprint arXiv:2410.01648 (2024).
[18] S. Agarwal, M. S. Akhtar, S. Yadav, Overview of the PerAnsSumm 2025 shared task on
perspectiveaware healthcare answer summarization, in: S. Ananiadou, D. Demner-Fushman, D. Gupta,
P. Thompson (Eds.), Proceedings of the Second Workshop on Patient-Oriented Language Processing
(CL4Health), Association for Computational Linguistics, Albuquerque, New Mexico, 2025, pp. 445–
455. URL: https://aclanthology.org/2025.cl4health-1.41/. doi:10.18653/v1/2025.cl4health-1.
41.
[19] M. Rodríguez-Ortega, E. Rodríguez-Lopez, S. Lima-López, C. Escolano, M. Melero, L. Pratesi,
L. Vigil-Gimenez, L. Fernandez, E. Farré-Maduell, M. Krallinger, Overview of MultiClinSum task
at BioASQ 2025: evaluation of clinical case summarization strategies for multiple languages: data,
evaluation, resources and results., in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), CLEF 2025
Working Notes, 2025.</p>
    </sec>
    <sec id="sec-9">
      <title>A. MaLei Team Online Resources</title>
      <p>The sources for the MaLei Team at MultiClinSUM shared task 2025 will be available via
• GitHub https://github.com/Libo-Ren/MultiClinSum,
• Our earlier PerAnSumm page from Manchester Bees https://github.com/pabloRom2004/
-PerAnsSumm-2025.</p>
    </sec>
    <sec id="sec-10">
      <title>B. Prompt</title>
      <sec id="sec-10-1">
        <title>B.1. Start Prompt</title>
        <p>An example of starting a prompt using ISP is shown in Figure 8.</p>
      </sec>
      <sec id="sec-10-2">
        <title>B.2. Instruction for Prompt Refinement</title>
        <p>Meta-prompt used to instruct the LLM to revise the summary-generation prompt is shown in Figure 9.
Figure 8: An example of starting prompt
Figure 9: Instruction for Prompt Refinement</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Stiggelbout</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Pieterse</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. De Haes</surname>
          </string-name>
          ,
          <article-title>Shared decision making: concepts, evidence, and practice</article-title>
          ,
          <source>Patient education and counseling 98</source>
          (
          <year>2015</year>
          )
          <fpage>1172</fpage>
          -
          <lpage>1179</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Stiggelbout</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Grifioen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brands</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Melles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rietjens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kunneman</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Van Der Kolk</surname>
          </string-name>
          , C. Van
          <string-name>
            <surname>Eijck</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Snelders</surname>
          </string-name>
          ,
          <article-title>Metro mapping: development of an innovative methodology to codesign care paths to support shared decision making in oncology</article-title>
          ,
          <source>BMJ evidence-based medicine 28</source>
          (
          <year>2023</year>
          )
          <fpage>291</fpage>
          -
          <lpage>294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. N. Maria</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Nunzio</surname>
          </string-name>
          , Giorgio,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>