<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing Multilingual Medical Summarization via Contextual Keyword Augmentation⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Md Shahidul Salim</string-name>
          <email>MdShahidul_Salim@uml.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lianne Fu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arav Adikesh Ramakrishnan</string-name>
          <email>aravadikeshr@umass.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sunjae Kwon</string-name>
          <email>sunjaekwon@umass.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zonghai Yao</string-name>
          <email>zonghaiyao@umass.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hong Yu</string-name>
          <email>Hong_Yu@uml.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Healthcare Organization and Implementation Research, VA Bedford Health Care</institution>
          ,
          <addr-line>MA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Medicine, University of Massachusetts Medical School</institution>
          ,
          <addr-line>Worcester, MA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Manning College of Information and Computer Sciences, University of Massachusetts Amherst</institution>
          ,
          <addr-line>MA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Miner School of Computer and Information Sciences, University of Massachusetts Lowell</institution>
          ,
          <addr-line>MA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the work of the UMass BioNLP team for the MultiClinSUM multilingual medical text summarization task. We introduce MedCOD, a novel framework that improves multilingual summarization through keyword-based contextual augmentation. MedCOD begins by extracting medical keywords from full clinical texts using the Qwen2.5-14B model. These keywords are translated into five languages-English, Spanish, French, German, and Portuguese-using the NLLB 3.3B model, and validated through back-translation and semantic equivalence checking with Qwen2.5-14B. The resulting multilingual keyword chains are incorporated into prompts as a structured context. We evaluate MedCOD using two open-source large language models, Qwen2.5-14B and Phi-4B, in both zero-shot and fine-tuned settings. We fine-tune the model using parametereficient LoRA on the MultiClinSUM training set. Experimental results demonstrate that MedCOD significantly improves summarization quality, especially in non-English languages. Ablation studies show that both promptlevel augmentation and fine-tuning contribute to the observed performance gains.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Medical Text Summarization</kwd>
        <kwd>Multilingual NLP</kwd>
        <kwd>Contextual Augmentation</kwd>
        <kwd>Chain of Dictionary</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Electronic Health Records (EHRs) have become an integral part of modern healthcare, serving as a
critical medium for enhancing patient engagement and facilitating better communication between
healthcare providers and patients. Recognizing the value of EHRs, the Centers for Medicare &amp; Medicaid
Services (CMS) Incentive Programs have promoted the meaningful use of EHRs, empowering patients to
access and manage their health information electronically. However, the full benefits of EHR accessibility
are not uniformly realized, particularly among patients with limited English proficiency [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In the
United States, approximately 17% of the population identifies as Hispanic, with nearly half possessing
limited English skills. This language barrier significantly impedes their ability to comprehend health
records, potentially leading to misunderstandings, reduced medication adherence, and poorer health
outcomes.
      </p>
      <p>
        In the domain of medical information processing, a major challenge lies in developing robust text
summarization systems capable of efectively condensing complex medical texts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The MultiClinSUM
Shared Task provides a valuable platform for addressing this issue by encouraging the development
of intelligent multilingual summarization systems. This paper focuses on the MultiClinSUM text
summarization task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], where we introduce a novel approach leveraging a medical chain of dictionary
to enrich the input with augmented contextual information. This dictionary is constructed by extracting
medical keywords from full texts and translating them into multiple languages. The enriched context is
then used to enhance the performance of open-source language models in multilingual summarization.
While medical text summarization in English is already well-established, our work emphasizes improving
summarization quality in other languages, where current models often underperform.
      </p>
      <p>
        In this study, we have applied our framework named MedCOD, which builds upon Chain-of-Dictionary
Prompting (COD) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and MedCOD [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]—recent approaches that enhance machine translation by
incorporating multilingual dictionaries into large language model (LLM) prompts. COD has demonstrated
strong performance on general-domain translation tasks (e.g., FLORES-200), improving translation
quality for many language pairs and even outperforming strong baselines like NLLB 3.3B in certain
settings. We extend this approach by integrating various prompting methods, using the large language
model as a knowledge base (LLM-KB), to enhance English-to-Spanish biomedical translation.
      </p>
      <p>
        In our framework, we first employed the Qwen2.5-14B [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] model to extract medical keywords from
the full text. These extracted keywords were then translated into multiple target languages using the
NLLB 3.3B [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] model. To ensure semantic consistency and translation quality, we used the Qwen2.5-14B
model to perform equivalence checks between the translated and back-translated keywords. Finally,
we evaluated the efectiveness of the overall framework using two open-source large language models
(LLMs): Qwen2.5-14B and Phi-4 14B.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Given that this study encompasses both pipeline development and prompt engineering, we organize
the related work accordingly into four subsections: Section 2.1 Medical summarization, Section 2.2
Quantization and Parameter-Eficient Fine-Tuning (PEFT) , Section 2.3 Prompt Engineering, and Section
2.4 COD and MedCOD in Machine Translation. We begin by reviewing prior work relevant to the core
pipeline, including medical text summarization with LLMs, model eficiency techniques, and advances
in prompt design strategies.</p>
      <sec id="sec-2-1">
        <title>2.1. LLMs in Medical Text Summarization</title>
        <p>
          The widespread implementation of EHRs has significantly increased the clinical documentation burden,
contributing to rising stress and clinician burnout [
          <xref ref-type="bibr" rid="ref2 ref8 ref9">2, 8, 9, 10</xref>
          ]. Analyzing vast textual data and
summarizing key information from EHRs imposes a substantial strain on clinician time. With the advent of
large language models (LLMs) like ChatGPT [11], the ability to generate coherent and clinically relevant
summaries has improved considerably [12, 13]. While various benchmarks exist for evaluating LLMs
on general NLP tasks [14, 15], many fail to capture the nuances of clinical reasoning, terminology, and
context. This motivates the development of medically grounded summarization pipelines tailored to
real-world use.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Quantization and Parameter-Eficient Fine-Tuning (PEFT)</title>
        <p>Quantization [16] is a model compression technique that reduces inference cost by replacing
highprecision floating-point weights (e.g., float32) with lower-precision formats such as int8 or bfloat16.
This significantly lowers memory usage and computation time, making LLMs deployable on
resourceconstrained devices. However, quantization may lead to a drop in accuracy, especially for tasks requiring
domain-specific reasoning or subtle semantic understanding—common in medical applications.</p>
        <p>Parameter-Eficient Fine-Tuning (PEFT) [ ? ] complements quantization by updating only a small
subset of newly added parameters (e.g., LoRA adapters or prefix vectors) while keeping the core model
frozen. This approach improves fine-tuning eficiency in both computation and storage, especially for
domain adaptation, as it avoids full backpropagation through billions of parameters. Unlike traditional
full-model fine-tuning, PEFT enables rapid iteration across downstream tasks and user-specific domains
while maintaining generalization.</p>
        <p>Although typically studied in isolation, quantization and PEFT are increasingly being used together to
build low-resource yet performant LLM-based applications. Recent work demonstrates that when paired
carefully, these methods can preserve task-relevant accuracy while reducing both training and inference
costs—an important consideration for medical NLP systems deployed in production environments.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Prompt Engineering</title>
        <p>Prompt engineering is the practice of designing input prompts to guide LLM behavior more efectively,
and has become essential in aligning model outputs with task requirements—particularly in zero- and
few-shot settings. In clinical NLP, carefully crafted prompts can greatly improve performance on tasks
like summarization, question answering, and reasoning.</p>
        <p>Among prompt engineering methods, Chain-of-Thought (CoT) prompting [11] stands out for its
ability to elicit step-by-step reasoning from LLMs. CoT is particularly useful in medical MCQs, where
intermediate reasoning steps reflect critical thinking paths. However, studies have shown that diferent
CoT prompts may lead to divergent answers due to variability in reasoning trajectories. To mitigate
this, self-consistency [17, 18] aggregates multiple CoT responses and selects the most frequent answer,
improving robustness and accuracy.</p>
        <p>Recent work also explores prompt personalization (e.g., persona-based prompting), grounding (e.g.,
incorporating external knowledge), and prompt augmentation (e.g., using keywords or dictionary entries)
as techniques to improve LLM alignment with task-specific needs. However, prompt engineering still
lacks systematic understanding, especially in specialized domains like healthcare, where task ambiguity
and terminology complexity pose unique challenges.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. COD and MedCOD in Machine Translation</title>
        <p>
          Recent advances in multilingual machine translation have significantly enhanced LLM capabilities in
low-resource settings. The No Language Left Behind project [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] introduced a highly scalable model
covering over 200 languages, achieving state-of-the-art results on FLORES-200. Similarly, mBART [19]
employed denoising autoencoding for multilingual sequence-to-sequence tasks.
        </p>
        <p>
          Prompt-based approaches have proven particularly efective for multilingual translation.
Chain-ofDictionary Prompting (COD) [20] augments prompts with multilingual dictionaries to guide translation
generation. MedCOD [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] extends this to the medical domain by incorporating domain-specific
dictionaries and keywords into the prompt, thereby improving terminology accuracy and factual alignment.
        </p>
        <p>Medical machine translation (Med-MT) is uniquely challenging due to the need for precise
terminology, context preservation, and limited training data. Early work by Liu and Cai [21] identified
common semantic drifts and domain-specific errors when translating EHRs. More recent eforts such
as BiomedBench [22] and MeLoT [23] address these limitations by providing multilingual biomedical
corpora for systematic evaluation.</p>
        <p>Our work extends MedCOD beyond sentence-level translation by integrating it into summarization
prompts—efectively unifying translation and summarization via contextual augmentation. This design
enables LLMs to generate multilingual medical summaries with improved fluency, fidelity, and task
alignment, particularly when paired with keyword filtering and open-source PEFT-tuned models.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>We evaluated open-source LLMs for their efectiveness in medical text summarization across four
languages: English, Spanish, French, and Portuguese. Figure 1 provides an overview of our framework,
which incorporates the MultiClinSUM dataset, various LLMs, and a specific prompting methodology.
Our analysis focuses on the impact of MedCOD, our proposed framework designed to identify the most
suitable augmented knowledge (referred to as contextual information throughout this paper) to support
MultiClinSUM Dataset (3.4k testing and 100 ablation study
instances of full-text/summary pairs per language in English,</p>
      <p>Spanish, French, and Portuguese)</p>
      <p>Extract medical</p>
      <p>keywords
Identify medical</p>
      <p>keywords
Translate each keyword</p>
      <p>in four
languages(en,es,de,fr,pt)</p>
      <p>Evaluate keyword
translation quality using
back-translation and
equivalence checking.</p>
      <p>Create medical chain</p>
      <p>of dictionary
Create prompt for</p>
      <p>LLMs using
structured context</p>
      <p>Analyze results
Evaluate:Rouge-L-Sum
and BERTScore
Test open/Closed</p>
      <p>source models
Incorporate prompt</p>
      <p>strategy
Finetune open-source
models(25k instances)</p>
      <p>Context?
Finetune
medical summarization. Furthermore, we explore the role of fine-tuning techniques in enhancing the
performance of these models.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset Preparation and Keyword Extraction</title>
        <p>For dataset preparation, we utilized three subsets of the MultiClinSum dataset, which is designed to
support multilingual clinical summarization across four primary languages: English, Spanish, French,
and Portuguese. The first subset is the large-scale training set, comprising 25,902 full-text and summary
pairs distributed across the four languages. This dataset served as the foundation for training our
multilingual models. The second subset is the oficial test set used in the MultiClinSum Shared Task,
which includes 3,396 clinical case reports in English, 3,406 in Spanish, 3,469 in French, and 3,442 in
Portuguese. This standardized test set enabled consistent evaluation of model performance across
languages. Additionally, for our ablation study, we constructed a smaller subset drawn from the
gold</p>
        <p>MultiClinSum Dataset Subsets Comparison</p>
        <sec id="sec-3-1-1">
          <title>Size</title>
          <p>Use</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Languages</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Keyword</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Extraction</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>Training Set</title>
          <p>25,902 pairs per language
English, Spanish,
French, Portuguese
Training multilingual
models</p>
          <p>No</p>
        </sec>
        <sec id="sec-3-1-6">
          <title>Test Set</title>
          <p>~3,400 reports per
language
English, Spanish,
French, Portuguese
Evaluating model
performance</p>
          <p>Yes</p>
        </sec>
        <sec id="sec-3-1-7">
          <title>Validation Set</title>
          <p>100 pairs per</p>
          <p>language</p>
          <p>English, Spanish,
Portuguese, German</p>
          <p>Analyzing model
behavior</p>
          <p>Yes
standard training data—which serves as our validation dataset—consisting of 100 full-text and summary
pairs for each language. This subset includes data in English, Spanish, Portuguese, and German, and
was used to analyze the efect of language-specific context on model behavior. Following this, we aim to
build a medical keyword dictionary; to facilitate this, we extract keywords from each full-text document
in both the test and validation subsets using LLM-KB. These extracted keywords were later used to test
and validate the models. Figure 2 shows the dataset preparation and keyword extraction.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Multilingual Keyword Translation and Quality Evaluation</title>
        <p>For each medical concept in our test and validation sets, we translated the keywords into five
languages—English, German, Spanish, French, and Portuguese. To ensure translation quality, we applied
a back-translation technique: each translated keyword was back-translated, and LLM-KB was used
to verify semantic equivalence between the original and the back-translated version. This filtering
step ensured that only high-quality translations were retained, reducing potential confusion during
summarization. The validated translations were then compiled into a multilingual medical keyword
dictionary, which served as the foundation for generating structured prompts.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Medical Dictionary Construction and Prompt Engineering</title>
        <p>We experimented with various prompting strategies, incorporating information from LLM-KB to
assess which types of structured input contributed most to summarization performance. Ultimately,
we selected MedCOD as our final prompting method, as it consistently delivered the most accurate
results. These structured prompts enriched the input with contextual information, enabling the model
to better understand sentence meaning and structure, thereby improving the quality of multilingual
summarization. The prompts are presented in Table 1.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Fine-tuning with Low-Rank Adaptation (LoRA) and hyperparameters tuning</title>
        <p>As illustrated in Figure 1, the MedCOD framework primarily functions as a prompting strategy that
supplies external knowledge to LLMs. Beyond prompting, we further enhance the performance of
open-source LLMs by fine-tuning them to better leverage the contextual information provided. To
achieve this, we use LoRA [24], a lightweight and eficient fine-tuning technique that significantly
reduces the number of trainable parameters. LoRA works by introducing a small set of trainable weights
into the model while keeping the original parameters frozen. This method not only accelerates training
and reduces memory usage but also produces compact model weight files—typically only a few hundred
megabytes—making them easier to store, distribute, and deploy.</p>
        <p>Instruction-tuned format (SFT mode), where each training sample consists of structured dialogue
with distinct roles (e.g., system, user, assistant). The model is trained to predict only the assistant
response, with loss masking applied to exclude input tokens from gradient updates. This is suitable for
instruction-following tasks.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analyses</title>
      <sec id="sec-4-1">
        <title>4.1. Experiment Settings</title>
        <p>
          4.1.1. Datasets
In this study, we utilized the MultiClinSum dataset [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], which contains clinical case reports designed for
multilingual summarization tasks across four languages: English, Spanish, French, and Portuguese. The
dataset comprises 3,396 full-text clinical cases in English, 3,406 in Spanish, 3,469 in French, and 3,442 in
Portuguese. We also used the MultiClinSum large-scale training datasets for model training across four
languages. Each dataset contains 25,902 full-text and summary pairs in English , Spanish, French, and
Portuguese. This rich multilingual dataset enables efective training, fine-tuning, and comprehensive
evaluation of summarization models in the medical domain, supporting cross-lingual benchmarking
and demonstrating the generalizability of our proposed framework. For the ablation study, we selected
a small subset from the gold-standard training dataset in English, Spanish, German, and Portuguese,
with each language containing 592 full-text and summary pairs. The test set size was intentionally kept
small due to the high computational cost of applying our MedCOD framework to each instance using
LLM-KB—a process that is both resource-intensive and time-consuming.
4.1.2. Models
We evaluated a range of open-source LLMs for multilingual medical summarization. Our experiments
included Phi-4 (14B) [25], Qwen2.5-14B [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], and GPT-4o Mini [26] as baseline models. Additionally, we
employed NLLB-200 3.3B [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] as a translation model to support and enhance the prompting methods
used in our MedCOD framework.
        </p>
        <p>Phi-4 (14B) Developed by Microsoft, Phi-4 is a 14-billion-parameter LLM that emphasizes high data
quality through extensive use of synthetic data during training. Unlike earlier versions that relied
heavily on distillation from teacher models like GPT-4, Phi-4 surpasses its predecessor in STEM-related
question-answering tasks. This performance gain is primarily due to improved data generation and
post-training techniques, while its architecture remains largely consistent with Phi-3.</p>
        <p>Qwen2.5-14B Qwen2.5 is an advanced LLM series released by Alibaba Cloud, significantly enhanced
through both pre-training and post-training phases. The pre-training data was expanded from 7 trillion
to 18 trillion tokens, resulting in notable improvements in common sense reasoning, domain-specific
knowledge, and general performance. Post-training involved over 1 million supervised samples and
multi-stage reinforcement learning. Qwen2.5 includes multiple model sizes, with the 14B and 7B
versions publicly available as open-weight models.</p>
        <p>GPT-4o Mini GPT-4o Mini is a compact version of OpenAI’s GPT-4o, designed for eficient
deployment with minimal performance compromise. Released in July 2024, it supports both text and
image inputs and maintains strong performance across various benchmarks, including MMLU (82%),
MGSM (87.0%), and HumanEval (87.2%). With a 128K token context window and robust multilingual
capabilities, GPT-4o Mini is well-suited for lightweight, multimodal reasoning tasks.</p>
        <p>NLLB-200 3.3B Developed by Meta AI, NLLB-200 3.3B is a multilingual translation model capable of
translating across 200 languages, including many low-resource languages. Leveraging a Sparsely Gated
Mixture of Experts architecture, it achieves a 44% improvement in BLEU score over previous models,
according to the FLORES-200 benchmark. NLLB-200 emphasizes both translation quality and safety,
playing a key role in enhancing cross-lingual understanding within our framework.
4.1.3. Evaluation
We evaluated the summarization performance using two main metrics: ROUGE-L-Sum [27] and
BERTScore [28].</p>
        <p>ROUGE-L-Sum measures the overlap between generated summaries and reference summaries,
focusing on the longest common subsequence to capture sentence-level similarity. It is particularly
suitable for extractive summarization tasks, where key sentences from the original text are selected and
combined. The calculation for ROUGE-L-Sum is built upon the standard ROUGE-L formulas for recall,
precision, and F1-score, which are applied to each pair of sentences. For a given reference sentence ()
and a candidate sentence (), the formulas are:</p>
        <p>Recall (lcs) Measures what fraction of the reference sentence is captured in the candidate sentence.
where LCS(, ) is the length of the longest common subsequence of words, and length() is the number
of words in the reference sentence.</p>
        <p>Precision (lcs) Measures what fraction of the candidate sentence is relevant compared to the
reference sentence.
where length() is the number of words in the candidate sentence.</p>
        <p>F1-Score (lcs) The harmonic mean of recall and precision, providing a single, balanced score.</p>
        <p>LCS(, )
lcs = length()</p>
        <p>LCS(, )
lcs = length()
lcs =
2 · lcs · lcs
lcs + lcs
BERTScore evaluates semantic similarity between the generated and reference summaries using
contextual embeddings from pre-trained BERT models. Unlike ROUGE, it captures meaning beyond
exact word matches. It reports three components: Precision (relevance of generated content), Recall
(coverage of reference content), and F1-Score (harmonic mean of precision and recall), providing a
balanced assessment of summary quality.</p>
        <p>The calculation of BERTScore involves generating contextual embeddings for each token, computing
their similarity, and then aggregating these values into precision and recall scores.</p>
        <p>Let the reference sentence be a sequence of tokens  = ⟨1, 2, . . . , ⟩ and the candidate sentence
be ˆ = ⟨ˆ1, ˆ2, . . . , ˆ⟩.</p>
        <p>Contextual Embeddings Both sentences are passed through a pre-trained BERT model to obtain a
sequence of contextual embedding vectors for each token. Let the embedding for token  be denoted
as x and for ˆ as x^ .</p>
        <p>Similarity Matrix The cosine similarity is calculated for every pair of tokens between the reference
and candidate sentences, creating a similarity matrix. The cosine similarity between two embedding
vectors a and b measures their alignment.</p>
        <p>Using the similarity scores, BERTScore computes recall, precision, and an F1-score through a greedy
matching process.</p>
        <p>Recall (BERT) Recall measures how well the candidate sentence captures the content of the reference
sentence. For each token in the reference sentence, the metric finds the most similar token in the
candidate sentence based on their embedding similarity. The recall score is the average of these
maximum similarity scores.</p>
        <p>The equation for recall is:
BERT =
1 ∑︁</p>
        <p>max (x x^ )
 =1 =1,...,
Assuming the embedding vectors are normalized, their dot product x x^ is equivalent to their cosine
similarity.</p>
        <p>Precision (BERT) Precision measures how relevant the tokens in the candidate sentence are with
respect to the reference sentence. For each token in the candidate sentence, it finds the most similar
token in the reference sentence. Precision is the average of these maximum similarity values.</p>
        <p>The equation for precision is:
BERT =
1 ∑︁</p>
        <p>max (x x^ )
 =1 =1,...,
BERT = 2
BERT · BERT
BERT + BERT</p>
        <p>F1-Score (BERT) The F1-score is the harmonic mean of precision and recall, providing a single,
balanced metric that reflects both accuracy and completeness.</p>
        <p>The equation for the F1-score is:</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. MultiClinSUM Oficial Evaluations</title>
        <p>In this section, we present the oficial evaluation results of our system on the MultiClinSum dataset for
summarizing clinical case reports in English, Spanish, French, and Portuguese. We submitted only one
result after the deadline, which we included in the Table 2.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Ablation Study</title>
        <p>We conducted an ablation study using 100 gold-standard annotated instances in four languages:
English, Spanish, French, and Portuguese. The experiments were run on two diferent instruction-tuned
models—Qwen2.5, Phi-4, and a proprietary GPT-4o-mini model—and evaluated using ROUGE-L and
BERTScore metrics (Precision, Recall, F1).
4.3.1. Efect of MedCOD Context on Base LLM Performance
To evaluate the utility of MedCOD as an external contextual augmentation method, we compare
the performance of base (non-finetuned) LLMs with and without contextual input. Across multiple
languages and models, we consistently observe that the inclusion of MedCOD context improves
summarization performance in terms of BERTScore and ROUGE-L. For instance, using Qwen2.5 in
French, the BERTScore_F increases from 0.6084 (no context) to 0.7383 (with MedCOD context), and
in Portuguese, from 0.6493 to 0.7577. Similarly, Phi-4 in Spanish improves from 0.6719 to 0.7632 with
context. These results highlight MedCOD’s efectiveness in guiding LLMs during summarization,
especially for languages where models may lack suficient domain coverage or training representation.</p>
        <p>However, for English—the dominant language in most training corpora—LLMs already demonstrate
strong capabilities in medical summarization. This may stem from data contamination (no one knows
exactly which data has been used to train these models) or simply reflect the fact that current models
are already suficiently capable for such tasks in English. As a result, additional training or knowledge
augmentation (e.g., using MedCOD) ofers limited benefit. For example, in Phi-4, the diference in
BERTScore_F between using context (0.7704) and no context (0.7814) is marginal, suggesting saturation
in model performance.
4.3.2. Efect of MedCOD Context on finetuned LLM Performance
To further understand the impact of MedCOD beyond zero-shot settings, we assess its efect on finetuned
LLMs. We finetuned our LLMs on a 25K supervised summarization dataset provided by MultiClinSUM,
which covers diverse clinical narratives across multiple languages. Despite being trained on this
substantial dataset, the addition of MedCOD context during inference still leads to measurable improvements
or maintains strong performance across various languages. For instance, in Qwen2.5 (Spanish), the
BERTScore_F improves from 0.7825 (finetuned, no context) to 0.7903 (finetuned, with context). Similarly,
Phi-4 (French) shows BERTScore_F of 0.7784 with context and 0.7877 without, indicating competitive
performance. In Portuguese, Qwen2.5 improves from 0.7740 to 0.7755 when MedCOD is provided.
While these gains are smaller than those seen in base models, the results suggest that MedCOD can
act as a valuable auxiliary signal, especially in complex domains like medical summarization where
implicit knowledge and subtle cues are important. Even after extensive finetuning, contextual cues
from MedCOD help reinforce or clarify information, potentially addressing coverage gaps not captured
in the training data.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>In this study, we introduced MedCOD, a multilingual keyword-based contextual augmentation
framework designed to enhance medical text summarization in low-resource language settings. Through
extensive experiments on the MultiClinSUM dataset, we identified a number of key insights that
illuminate both the strengths and limitations of current open-source LLMs for multilingual medical
summarization.</p>
      <p>Performance Saturation in English. For English—the dominant language in most LLM pretraining
corpora—existing models already exhibit strong performance in medical summarization. As shown in
Table 3, the base Qwen2.5 model, without context or fine-tuning, achieves a BERTScore F of 0.7883, and
ifne-tuning brings only a marginal improvement (0.7869). Similarly, Phi-4 achieves 0.7873 in the
finetuned setting without MedCOD context. These results suggest that English summarization tasks have
reached a saturation point for current LLMs. The efectiveness may stem from data contamination (since
the actual pretraining corpus is unknown for these models) or simply from the inherent advantages in
English-centric training pipelines. As a result, additional knowledge augmentation—either via MedCOD
or parameter-eficient fine-tuning—yields minimal or even negative gains.</p>
      <p>
        Challenges and Error Patterns in Non-English Settings. In contrast, performance in Spanish,
French, and Portuguese is consistently lower, especially in zero-shot scenarios. For example, the
Portuguese summarization task under the base Qwen2.5 setting (no fine-tuning, no context) yields
a BERTScoreF of only 0.6493, while French yields 0.6084 in the same configuration. This aligns with
known disparities in language representation across pretraining corpora, where non-English
languages—particularly Portuguese and French—are significantly underrepresented [
        <xref ref-type="bibr" rid="ref7">7, 29, 30</xref>
        ].
Consequently, models often fail to follow instructions or generate outputs in the target language, instead
defaulting to English. We observed several failure cases in which the models produced fluent summaries
in English, despite being explicitly prompted in Spanish, French, or Portuguese, as shown in Table 4.
      </p>
      <p>These behavior patterns directly afect reference-based metrics like ROUGE and BERTScore, leading
to abnormally low scores due to mismatched output languages.</p>
      <p>Benefits of MedCOD and Fine-Tuning in Non-English Settings. Both MedCOD and fine-tuning
significantly enhance performance in non-English summarization tasks, primarily by improving
language adherence and fluency. For instance, Qwen2.5 in Portuguese improves from 0.6493 (no context, no
ifne-tuning) to 0.7577 when only MedCOD is used. Likewise, Spanish performance with Phi-4 improves
from 0.6719 to 0.7632 in the same comparison. These gains are primarily due to better alignment with
the target language, as shown in the Table 5.</p>
      <p>Clinical Relevance</p>
      <p>Language Alignment
Moderate; partially disjoint and overly technical
list format
Generic; lacks focus on critical intervention out- Specific; includes key events like anticonvulsant
comes treatment and surgical outcome
Acceptable Spanish but includes translation arti- Native-like phrasing with proper medical
expresfacts sions</p>
      <p>
        Fine-tuning provides stronger improvements by explicitly exposing the model to task-specific and
language-specific patterns. However, even without training, MedCOD achieves meaningful zero-shot
gains through prompt-level augmentation. This is especially important in settings where computational
resources are limited, making full fine-tuning impractical. MedCOD achieves these gains by inserting
target-language medical keywords as context, which serves as both a language anchor and a domain
signal. This aligns with observations in prompt-based adaptation literature [
        <xref ref-type="bibr" rid="ref4">31, 32, 4</xref>
        ].
MedCOD Combined with Fine-Tuning Achieves Best Results. The combination of MedCOD
and fine-tuning yields the best performance across most non-English tasks. For example, in the Spanish
task, Qwen2.5 with both fine-tuning and MedCOD achieves a BERTScore F of 0.7903, outperforming all
other configurations. This supports our hypothesis that MedCOD provides complementary contextual
grounding that reinforces the representations learned during fine-tuning. As described in prior work on
prompting [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], providing high-salience, low-ambiguity input tokens can help direct LLM attention and
reduce reasoning drift. In our setting, MedCOD’s multilingual keyword chains ofer just such signal,
particularly efective when LLMs face under-represented linguistic domains which show in table 6.
Limitations and Future Work.
      </p>
      <p>Despite promising results, several limitations remain:
• Information Overload in Input. MedCOD expands the prompt with multiple language tags and
keyword chains, which may overwhelm the model. In some cases, we observed long, disorganized
outputs—likely caused by the model attending to less relevant context tokens. We refer to this as
“COD explosion.” Future work can explore input filtering strategies, such as perplexity-guided
token selection or keyword salience ranking, as discussed in [33, 34, 35].
• Minimal Gains in English. As noted earlier, injecting additional task-specific knowledge (via
MedCOD or LoRA) in English settings ofers little benefit. Worse, overly long prompts may distract
from key content or introduce inconsistencies. This suggests a need for adaptive prompting
or context compression strategies. Additionally, test-time adaptation methods [36]—e.g.,
selfconsistency [37], self-refinement [ 38], test-time training [39]—may yield more value in saturated
English domains.
• Unexplored Factors. Due to time and resource constraints, several areas remain
underinvestigated: (1) The discrepancy between ROUGE and BERTScore in some languages (e.g.,
Spanish vs. French) in shown in table. (2) Comparison between monolingual and
multilingual fine-tuning setups; (3) More analysis about whether MedCOD improves not just linguistic
consistency but also content structure or factuality.</p>
      <p>Looking ahead, we see significant potential for MedCOD to support a range of patient-centered
multilingual clinical NLP applications. Specifically, the ability to generate accurate and readable
summaries across multiple languages can benefit real-world scenarios such as: (1) Patient-facing summaries,
which simplify complex clinical language to improve health literacy; (2) Discharge summaries, which
ofer clear and concise overviews of hospital visits to support care transitions; (3) Medical literature
summarization, which distills key findings and methodologies from multilingual scientific publications;
(4) Multilingual clinical communication, where summarization combined with translation facilitates
cross-lingual understanding of medical records; and (5) Telemedicine and remote consultations, where
concise summaries of patient data support eficient triage and diagnostic workflows. Indeed, our
motivation for participating in this shared task was to lay the groundwork for such patient-oriented
multilingual applications 1. While the scope of this work was constrained by the competition’s structure,
we believe that the dataset provided by the organizers serves as a strong and practical foundation for
future research aimed at advancing these patient-centered multilingual application scenarios. Moreover,
our findings in this task align with a broader and well-documented challenge: many patient-centered
tasks sufer from imbalanced training data across languages, resulting in relatively strong performance
in English but substantial degradation in other widely spoken languages, such as Spanish, French, and
Portuguese. This disparity is particularly concerning given that patients who are not native English
speakers often have the greatest need for accessible, high-quality clinical NLP tools. Our future work
will therefore focus on extending our prior patient-centered BioNLP research [40, 41, 42, 43, 44]—largely
concentrated in English—to underrepresented languages by leveraging this resource, with the goal of
developing equitable, multilingual medical summarization systems that are not only clinically accurate
but also understandable and actionable for patients across diverse linguistic backgrounds.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work, we introduced MedCOD, a multilingual, keyword-based contextual augmentation
framework aimed at enhancing medical text summarization in low-resource language settings. Our
comprehensive experiments on the MultiClinSUM dataset, covering English, Spanish, French, and Portuguese,
demonstrate that MedCOD improves performance, particularly in non-English languages where baseline
models often struggle. While existing open-source LLMs, such as Qwen2.5 and Phi-4, already achieve
strong results in English—with minimal improvements from fine-tuning or context—our findings
reveal substantial gaps in Spanish, French, and Portuguese. These include frequent failures in language
adherence and factual completeness, which directly afect summarization quality and evaluation metrics.
1 https://temu.bsc.es/multiclinsum/</p>
      <p>MedCOD addresses these challenges by incorporating task-relevant, target-language medical
keywords into the input, acting as both a domain signal and a language anchor. This strategy improves
lfuency, coherence, and clinical relevance of the generated summaries, even in zero-shot scenarios
where fine-tuning is not feasible. Furthermore, combining MedCOD with fine-tuning yields the best
performance across all non-English tasks. For instance, in Spanish, Qwen2.5’s BERTScoreF improves
from 0.7650 (no context) to 0.7903 with the addition of MedCOD context and fine-tuning. These results
confirm that MedCOD provides a practical and efective solution for overcoming the multilingual
limitations of current LLMs and supports the development of equitable, language-inclusive clinical NLP
systems.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research was supported by the U.S. Department of Veterans Afairs (VA) under award VA HSR IIR
24-083 (I01HX003969). The content is solely the responsibility of the authors and does not necessarily
represent the views of the VA or the U.S. government.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used AI in order to perform grammar and spelling
checks. No generative AI tools were used to produce images or other creative content. After using
this tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the
publication’s content.
[10] J. M. Ehrenfeld, J. P. Wanderer, Technology as friend or foe? do electronic health records increase
burnout?, Current Opinion in Anesthesiology 31 (2018) 357–360.
[11] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot
learners, 2020. URL: https://arxiv.org/abs/2005.14165. arXiv:2005.14165.
[12] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, Y. Du,
C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie, J.-R. Wen, A survey
of large language models, 2025. URL: https://arxiv.org/abs/2303.18223. arXiv:2303.18223.
[13] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li,
S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, Y. Zhang, Sparks of artificial general intelligence:
Early experiments with gpt-4, 2023. URL: https://arxiv.org/abs/2303.12712. arXiv:2303.12712.
[14] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu,
A. Kumar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cosgrove, C. D. Manning, C. Ré, D.
AcostaNavas, D. A. Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong, H. Ren, H. Yao, J. Wang,
K. Santhanam, L. Orr, L. Zheng, M. Yuksekgonul, M. Suzgun, N. Kim, N. Guha, N. Chatterji,
O. Khattab, P. Henderson, Q. Huang, R. Chi, S. M. Xie, S. Santurkar, S. Ganguli, T. Hashimoto,
T. Icard, T. Zhang, V. Chaudhary, W. Wang, X. Li, Y. Mai, Y. Zhang, Y. Koreeda, Holistic evaluation
of language models, 2023. URL: https://arxiv.org/abs/2211.09110. arXiv:2211.09110.
[15] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing,
H. Zhang, J. E. Gonzalez, I. Stoica, Judging llm-as-a-judge with mt-bench and chatbot arena, NIPS
’23, Curran Associates Inc., Red Hook, NY, USA, 2023.
[16] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, D. Kalenichenko, Quantization
and training of neural networks for eficient integer-arithmetic-only inference, 2017. URL: https:
//arxiv.org/abs/1712.05877. arXiv:1712.05877.
[17] H. Nori, Y. T. Lee, S. Zhang, D. Carignan, R. Edgar, N. Fusi, N. King, J. Larson, Y. Li, W. Liu,
R. Luo, S. M. McKinney, R. O. Ness, H. Poon, T. Qin, N. Usuyama, C. White, E. Horvitz, Can
generalist foundation models outcompete special-purpose tuning? case study in medicine, 2023.</p>
      <p>URL: https://arxiv.org/abs/2311.16452. arXiv:2311.16452.
[18] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis,</p>
      <p>S. Pfohl, et al., Large language models encode clinical knowledge, Nature 620 (2023) 172–180.
[19] Y. Liu, et al., Multilingual denoising pre-training for neural machine translation, arXiv preprint
arXiv:2001.08210 (2020). URL: https://arxiv.org/abs/2001.08210.
[20] H. Lu, H. Yang, H. Huang, D. Zhang, W. Lam, F. Wei, Chain-of-dictionary prompting elicits
translation in large language models, in: Proceedings of the 2024 Conference on Empirical
Methods in Natural Language Processing, 2024, pp. 958–976.
[21] W. Liu, S. Cai, Translating electronic health record notes from english to spanish: A preliminary
study, in: BioNLP 2015, 2015, pp. 134–140.
[22] H. Chintakunta, M. Zhang, S. Shaar, et al., Multilingual biomedical translation benchmarks, arXiv
preprint arXiv:2301.02500 (2023). URL: https://arxiv.org/abs/2301.02500.
[23] S. Khare, B. Gholami, et al., Melot: A medical language translation dataset, arXiv preprint
arXiv:2307.07955 (2023). URL: https://arxiv.org/abs/2307.07955.
[24] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank
adaptation of large language models, 2021. arXiv:2106.09685.
[25] M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M.
Javaheripi, P. Kaufmann, J. R. Lee, Y. T. Lee, Y. Li, W. Liu, C. C. T. Mendes, A. Nguyen, E. Price,
G. de Rosa, O. Saarikivi, A. Salim, S. Shah, X. Wang, R. Ward, Y. Wu, D. Yu, C. Zhang, Y. Zhang,
Phi-4 technical report, 2024. arXiv:2412.08905.
[26] OpenAI, Gpt-4o mini: Advancing cost-eficient intelligence, 2024. URL: https://openai.com/index/
gpt-4o-mini-advancing-cost-eficient-intelligence/.
[27] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization
Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL:
https://aclanthology.org/W04-1013/.
[28] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation with
bert, 2020. URL: https://arxiv.org/abs/1904.09675. arXiv:1904.09675.
[29] T. Le Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon,</p>
      <p>M. Gallé, et al., Bloom: A 176b-parameter open-access multilingual language model (2023).
[30] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin,
et al., Opt: Open pre-trained transformer language models, arXiv preprint arXiv:2205.01068 (2022).
[31] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A systematic
survey of prompting methods in natural language processing, ACM computing surveys 55 (2023)
1–35.
[32] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder,
D. Zhou, et al., Language models are multilingual chain-of-thought reasoners, arXiv preprint
arXiv:2210.03057 (2022).
[33] W. S. Jang, S. Sultana, Z. Yao, H. Tran, Z. Yang, S. Kwon, H. Yu, Enhancing llms for identifying
and prioritizing important medical jargons from electronic health record notes utilizing data
augmentation, arXiv preprint arXiv:2502.16022 (2025).
[34] J. Chen, E. Druhl, B. Polepalli Ramesh, T. K. Houston, C. A. Brandt, D. M. Zulman, V. G. Vimalananda,
S. Malkani, H. Yu, A natural language processing system that links medical terms in electronic
health record notes to lay definitions: system development using physician reviews, Journal of
medical Internet research 20 (2018) e26.
[35] J. Chen, J. Zheng, H. Yu, et al., Finding important terms for patients in their electronic health
records: a learning-to-rank approach using expert annotations, JMIR medical informatics 4 (2016)
e6373.
[36] C. Snell, J. Lee, K. Xu, A. Kumar, Scaling llm test-time compute optimally can be more efective
than scaling model parameters, arXiv preprint arXiv:2408.03314 (2024).
[37] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, D. Zhou, Self-consistency
improves chain of thought reasoning in language models, arXiv preprint arXiv:2203.11171 (2022).
[38] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegrefe, U. Alon, N. Dziri, S. Prabhumoye,
Y. Yang, et al., Self-refine: Iterative refinement with self-feedback, Advances in Neural Information
Processing Systems 36 (2023) 46534–46594.
[39] J. Hu, Z. Zhang, G. Chen, X. Wen, C. Shuai, W. Luo, B. Xiao, Y. Li, M. Tan, Test-time learning for
large language models, arXiv preprint arXiv:2505.20633 (2025).
[40] H. Tran, Z. Yao, L. Li, H. Yu, Readctrl: Personalizing text generation with readability-controlled
instruction learning, arXiv preprint arXiv:2406.09205 (2024).
[41] S. Kwon, Z. Yao, H. S. Jordan, D. A. Levy, B. Corner, H. Yu, Medjex: A medical jargon extraction
model with wiki’s hyperlink span and contextualized masked language model score, in:
Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on
Empirical Methods in Natural Language Processing, volume 2022, 2022, p. 11733.
[42] Z. Yao, N. S. Kantu, G. Wei, H. Tran, Z. Duan, S. Kwon, Z. Yang, H. Yu, et al., Readme: Bridging
medical jargon and lay understanding for patient education through data-centric nlp, arXiv
preprint arXiv:2312.15561 (2023).
[43] P. Cai, Z. Yao, F. Liu, D. Wang, M. Reilly, H. Zhou, L. Li, Y. Cao, A. Kapoor, A. Bajracharya, et al.,
Paniniqa: Enhancing patient education through interactive question answering, Transactions of
the Association for Computational Linguistics 11 (2023) 1518–1536.
[44] Z. Yao, H. Yu, A survey on llm-based multi-agent ai hospital (2025).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <article-title>Translating electronic health record notes from english to spanish: A preliminary study</article-title>
          ,
          <source>in: Proceedings of BioNLP 15</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>134</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Van Veen</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Van Uden</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Blankemeier</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Delbrouck</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Aali</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bluethgen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Pareek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Polacin</surname>
            ,
            <given-names>E. P.</given-names>
          </string-name>
          <string-name>
            <surname>Reis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Seehofnerová</surname>
          </string-name>
          , et al.,
          <article-title>Adapted large language models can outperform medical experts in clinical text summarization</article-title>
          ,
          <source>Nature medicine 30</source>
          (
          <year>2024</year>
          )
          <fpage>1134</fpage>
          -
          <lpage>1142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodríguez-Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lima-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Escolano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Melero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pratesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vigil-Giménez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farré-Maduell</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Krallinger, Overview of multiclinsum task at bioasq 2025: Evaluation of clinical case summarization strategies for multiple languages: Data, evaluation, resources and results</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          ,
          <year>2025</year>
          . To appear.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>Chain-of-dictionary prompting elicits translation in large language models</article-title>
          ,
          <source>in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>958</fpage>
          -
          <lpage>976</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Salim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <article-title>Enhancing english-to-spanish medical translation of large language models using enriched chain-of-dictionary framework</article-title>
          ,
          <year>2025</year>
          . Under review.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <year>Qwen2</year>
          .
          <article-title>5: A party of foundation models</article-title>
          ,
          <year>2024</year>
          . URL: https://qwenlm.github.io/blog/qwen2. 5/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Costa-jussà</surname>
          </string-name>
          , J.
          <string-name>
            <surname>Cross</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Çelebi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Elbayad</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Heafield</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Hefernan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Kalbassi</surname>
          </string-name>
          , et al.,
          <article-title>No language left behind: Scaling human-centered machine translation</article-title>
          ,
          <source>arXiv preprint arXiv:2207.04672</source>
          (
          <year>2022</year>
          ). URL: https://arxiv.org/abs/2207.04672.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Gesner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gazarian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dykes</surname>
          </string-name>
          ,
          <article-title>The burden and burnout in documenting patient care: an integrative literature review</article-title>
          ,
          <source>MEDINFO</source>
          <year>2019</year>
          :
          <article-title>Health and Wellbeing e-Networks for All (</article-title>
          <year>2019</year>
          )
          <fpage>1194</fpage>
          -
          <lpage>1198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Ratwani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Savage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Will</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Arnold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Khairat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Fairbanks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hodgkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Hettinger</surname>
          </string-name>
          ,
          <article-title>A usability and safety analysis of electronic health records: a multi-center study</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          <volume>25</volume>
          (
          <year>2018</year>
          )
          <fpage>1197</fpage>
          -
          <lpage>1201</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>