<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AI Stat Lab: A Modular Framework for Clinically Accurate Medical Image Captioning Using Vision-Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yunseo Lee</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hyun Jun Kim</string-name>
          <email>hyunjun0615@cau.ac.kr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heeseung Shin</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Changwon Lim</string-name>
          <email>clim@cau.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Applied Statistics, Chung-Ang University</institution>
          ,
          <addr-line>84 Heukseok-ro, Dongjak-gu, Seoul 06974</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Smart Cities, Chung-Ang University</institution>
          ,
          <addr-line>84 Heukseok-ro, Dongjak-gu, Seoul 06974</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Statistics and Data Science, Chung-Ang University</institution>
          ,
          <addr-line>84 Heukseok-ro, Dongjak-gu, Seoul 06974</addr-line>
          ,
          <country>Republic of Korea</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>We propose a modular framework for medical image captioning that integrates domain-adapted visual encoders, token-eficient representation via query-based compression, and post-hoc refinement. The architecture employs an ensemble of general-purpose and domain-specific vision encoders (SigLIP2 and BioMedCLIP), a Q-Former for dense concept-aware tokenization, and a LoRA-tuned Bio-Medical LLaMA-3 decoder. Auxiliary objectives guide the model to jointly predict UMLS concepts and semantic types, improving semantic grounding. At inference, captions from six independently trained variants are reranked using three complementary strategies-BioMedCLIP similarity, BLEURT scoring, and BioBERT-based centroid alignment. Evaluations on the ImageCLEF2025 Caption Prediction Task demonstrate consistent gains in semantic relevance and clinical factuality over single-encoder and non-multitask baselines. Our approach (team: AI Stat Lab, ID #1900) achieved third place with an overall score of 0.3229, corresponding to relevance and factuality scores of 0.5089 and 0.1369, respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Medical image captioning</kwd>
        <kwd>Vision-language model</kwd>
        <kwd>Dual Encoder</kwd>
        <kwd>UMLS concepts</kwd>
        <kwd>Caption reranking</kwd>
        <kwd>GPT summarization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Medical image captioning, automatically generating radiologist-style descriptions from imaging studies,
has the potential to accelerate report drafting, improve content-based image retrieval, and increase
the interpretability of diagnostic AI models. Compared with natural-image captioning, the task is
complicated by grayscale modalities, subtle anatomical cues, and a highly specialized vocabulary, all of
which demand fine-grained visual reasoning and domain knowledge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        While prior eforts have made notable progress by employing encoder-decoder frameworks trained
on paired image-text datasets, the performance of these systems is often hindered by limitations in
data quality, domain adaptability, and output reliability [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. For instance, low-resolution images
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and annotation-induced artifacts are prevalent in public medical datasets [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], degrading model
perception. Moreover, generic vision encoders may lack the capacity to extract subtle domain-specific
features [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and caption decoders often produce inconsistent or incomplete descriptions due to limited
grounding in clinical semantics [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. To address these limitations, we construct a modular medical
captioning framework by assembling and adapting proven techniques across the visual and language
modeling pipeline. In particular, the pre-processing stage includes resolution enhancement and visual
consistency adjustments [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. To address these limitations, we construct a modular medical captioning
framework by assembling and adapting proven techniques across the visual and language modeling
pipeline. Specifically, we integrate:
1. A dual-encoder configuration using SigLIP2 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and BioMedCLIP [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] for both general and
domain-specific feature extraction [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
2. A Query Transformer (Q-Former) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to reduce redundancy and enable concept-aware
representations,
3. A biomedical LLaMA-3 decoder [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] fine-tuned via Low-Rank Adaptation (LoRA) for eficient
adaptation, and
4. A post-hoc refinement stage that consolidates outputs from six independently trained captioning
models.
      </p>
      <p>
        This module employs GPT-4-based summarization [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and multiple reranking strategies—including
BiomedCLIP similarity, BLEURT [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] scoring, and centroid-based selection [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] using BioBERT [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]—to
generate a single, clinically coherent caption.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Medical image captioning has evolved alongside advances in vision-language modeling, primarily
following the encoder-decoder paradigm widely used in natural image captioning. Early works employed
convolutional neural networks (CNNs) as visual encoders paired with recurrent neural networks (RNNs)
or Transformer-based decoders to generate captions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, these approaches often lacked
clinical specificity, as they relied on general-purpose image features and were trained on limited or
noisy medical datasets.
      </p>
      <p>
        More recently, the integration of large-scale vision-language models (VLMs), such as BioMedCLIP [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
has enabled more transferable and semantically rich representations across diverse medical imaging
modalities [
        <xref ref-type="bibr" rid="ref19 ref20 ref21 ref22">19, 20, 21, 22</xref>
        ]. These models, pretrained on multimodal datasets, facilitate improved
generalization to unseen clinical data with minimal supervision.
      </p>
      <p>
        Furthermore, the advent of large language models (LLMs), including GPT [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and LLaMA [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ],
has further advanced captioning performance by providing enhanced language fluency, contextual
reasoning, and factual alignment. Some recent systems incorporate LLMs as decoders conditioned on
image-derived embeddings or prompts, allowing for richer and more coherent textual outputs.
      </p>
      <p>
        In parallel, post-hoc refinement strategies have emerged as a practical solution for improving caption
consistency. Ensemble-based generation followed by reranking using clinical relevance metrics—such
as BERTScore [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], BLEURT [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], and visual-semantic similarity—has shown promise in reducing
redundancy and hallucination. GPT-based summarization has also been explored to consolidate conflicting
candidate captions into a single coherent report.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <sec id="sec-3-1">
        <title>3.1. Pre-processing</title>
        <p>Training images in ROCOv2 exhibit two systematic defects, low spatial resolution and bright border
artifacts, that degrade visual embeddings and, by extension, caption quality. We therefore apply a
two-stage pre-processing pipeline comprising super-resolution and structure-aware inpainting.</p>
        <p>
          First, we observed that 3,485 training images exhibited spatial resolutions smaller than 300 × 300
pixels. Considering the non-negligible proportion of such images and the risk of losing fine-grained
visual cues crucial for captioning, we applied 2× super-resolution to these samples. For this purpose,
we utilized the Feedback Adaptive Weighted Dense Network (FAWDN) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] a recurrent convolutional
architecture equipped with a feedback mechanism and adaptive dense blocks. FAWDN progressively
refines image quality over multiple time steps by combining current inputs with hidden states from
previous iterations. The network is composed of shared input, hidden, and output units across all time
steps, and integrates an Adaptive Weighted Dense Block (AWDB) that captures multi-scale features
through a combination of 1×1 convolutional layers and dense connections. This network was selected
not only for its proven performance on diverse image datasets but also due to the availability of
pretrained models specific to the medical domain, allowing us to avoid resource-intensive training of
super-resolution models from scratch.
        </p>
        <p>
          Second, to address the frequent presence of white or overly bright borders in the dataset images—often
resulting from scanning artifacts or annotation overlays—we implemented a structure-aware inpainting
strategy instead of simple cropping. Specifically, we identified border regions with brightness levels
exceeding 245 within a fixed 8% margin around the image edges and applied the inpainting algorithm
introduced by Telea [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to fill these regions using nearby pixel information. Representative results
are shown in Figure 1. Unlike hard cropping, which risks discarding medically relevant content near
the periphery, this inpainting method preserves the overall anatomical integrity of each image while
eliminating non-informative border artifacts. This procedure enhances the visual consistency of inputs
and prevents the model from learning spurious cues unrelated to the actual medical content.
        </p>
        <p>Together, these pre-processing steps improve the signal-to-noise ratio in the image encoder input
and help stabilize caption generation by standardizing input quality across the dataset.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model Architecture</title>
        <p>The overall architecture of our proposed medical image captioning model is illustrated in Figure 2. The
model consists of dual vision encoders, a Query Transformer (Q-Former), and a domain-adapted LLaMA
decoder, which are described in detail in the following subsections.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Dual Encoder</title>
          <p>
            To derive robust and semantically rich visual representations from medical images, we adopt an ensemble
of two vision encoders. Specifically, we utilize SigLIP2 [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], a general-purpose image encoder pretrained
on large-scale natural image-text pairs, and BioMedCLIP [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ], a medical-domain-specific encoder
trained on 15 million image-caption pairs mined from PubMed Central.
          </p>
          <p>
            To address the lack of medical image knowledge in the original SigLIP2 model, we perform
domainspecific pre-adaptation by fine-tuning it on the dataset provided by the ImageCLEF2025 Caption
Prediction Task [
            <xref ref-type="bibr" rid="ref28 ref29">28, 29</xref>
            ]. This enhances the encoder’s ability to capture domain-relevant visual features
while preserving generalization capacity. Following standard practice, we remove the classification heads
from both encoders and extract intermediate features from their penultimate transformer layers. Let the
feature outputs from BioMedCLIP and SigLIP2 be denoted as fbioclip ∈ R× 768 and fsiglip ∈ R× 1536,
respectively. These representations are concatenated to form a unified embedding f = [fbioclip; fsiglip] ∈
R× 2304, which preserves domain-specific detail from BioMedCLIP and high-level semantics from
SigLIP2.
          </p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Query Transformer (Q-Former)</title>
          <p>
            To reduce redundancy and computational burden, we apply a Q-Former [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] that projects the
highdimensional visual embedding f ∈ R× 2304 into a fixed number of informative latent tokens. The
visual feature f is broadcast across a learnable set of query tokens, resulting in a sequence input
X ∈ R× 32× 2304. The Q-Former consists of six transformer layers with cross-attention modules that
allow each query token to selectively attend to parts of the visual input. The output of the Q-Former is
denoted as Z ∈ R× 32× 4096. This output is used for both caption generation and auxiliary concept
classification.
          </p>
          <p>
            To enhance medical grounding, we incorporate a multitask classification objective [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ]. The output
Z ∈ R× 32× 4096 is mean-pooled across the query dimension to produce a global representation
z¯ ∈ R× 4096. This representation is passed through two linear classifiers: one to predict concept
presence among 2,478 Concept Unique Identifiers (CUIs), and another to predict 21 coarse concept
types. The overall loss function is a weighted combination of the captioning loss and classification loss:
ℒtotal = ℒcaption +  · ℒ cls
where ℒcaption is the cross-entropy loss over caption tokens, and the auxiliary term ℒcls uses the
multilabel margin loss. This multi-task setup improves alignment between the generated captions and clinical
concepts visually present in the input image.
          </p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Caption Decoder</title>
          <p>
            For the caption generation task, we adopt Bio-Medical LLaMA-3-8B [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] a domain-specialized variant of
Meta-Llama-3-8B-Instruct [
            <xref ref-type="bibr" rid="ref31">31</xref>
            ] as the language decoder. The model has been fine-tuned on BioMedData,
a high-quality biomedical dataset containing over 500,000 entries. The dataset comprises a blend
of synthetic and manually curated samples, enabling robust generalization across a wide range of
biomedical contexts. During training, the 32 Q-Former tokens are inserted as prefix embeddings that
condition every decoding step on visual evidence. To enable eficient fine-tuning, we incorporate LoRA
[
            <xref ref-type="bibr" rid="ref32">32</xref>
            ] modules into the decoder. This allows the model to adapt to medical image captioning tasks with
minimal parameter updates while preserving the core language modeling capabilities of LLaMA.
          </p>
        </sec>
        <sec id="sec-3-2-4">
          <title>3.2.4. Model Variants for Ensemble</title>
          <p>
            To improve caption diversity and stabilize final output quality, we trained six independently
parameterized captioning models under varying training configurations. All models share the same core
architecture consisting of a Q-Former module and a LLaMA-based language decoder, but difer in their
visual encoder types and auxiliary training settings. Specifically, we constructed two models each for
three encoder configurations: (1) using BioMedCLIP alone, (2) using SigLIP2 alone, and (3) using a
dual-encoder setup that concatenates both BioMedCLIP and SigLIP2. Within each encoder group, one
model was trained with auxiliary concept classification (predicting UMLS concepts and types [
            <xref ref-type="bibr" rid="ref33">33</xref>
            ])
and one without it. These six models generate diverse caption candidates for each image, forming the
foundation for our post-processing pipeline described in the next subsection.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Post-processing</title>
        <p>To further refine the raw captions generated by our six independently trained captioning models, we
applied post-processing strategies aimed at improving both clinical coherence and factual relevance.
This section presents two major post-processing components: (1) summarization-based refinement
using GPT APIs and (2) candidate caption reranking based on semantic and domain-specific metrics.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Summarization-based Refinement</title>
          <p>
            We employed two GPT-4-based summarization [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] strategies to consolidate the six candidate
captions—each produced by a diferent model—into a single, medically accurate sentence. Both approaches
aimed to improve readability, reduce redundancy, and ensure consistency with structured medical
knowledge. The exact prompts used for each summarization method are provided in Table 1 below.
You are a board-certified radiologist. You are a radiologist summarizing multiple
capTASK tions of a medical image into ONE detailed
sen1. Parse EACH caption and list by line: &lt;MODAL- tence.
          </p>
          <p>ITY&gt;, &lt;ANATOMIC_SITE&gt;, &lt;PATHOLOGIES&gt;, etc. – Integrate the imaging modality, anatomical
lo2. Build a CONSENSUS table of token frequency. cation, pathological findings, and specific clinical
3. Resolve conflicts by majority vote or keep the details.
longer/specific one. – Use medically correct, extractive phrasing that
4. Compose ONE radiology-style sentence ( 35–45 maximizes token overlap—avoid paraphrasing
unwords): less synonymous medical terminology improves
– retains exact terms from table clarity.
– concatenates: modality → site → key findings → – Use present continuous tense with a
subjectclinical context predicate-object structure.
– uses “shows”, “demonstrates”, avoids headings – Keep the summary natural, clinically accurate,
– omits absent content. and around 40 words, allowing slight variation if
OUTPUT: FINAL_CAPTION: &lt;your summary&gt; shorter or longer improves clarity.</p>
          <p>CAPTIONS: – If captions contain inconsistencies, prioritize
findings with the highest diagnostic or therapeutic
relevance.</p>
          <p>
            HERE are the captions:
• caption 1: {caption1}
• caption 2: {caption2}
• caption 3: {caption3}
• caption 4: {caption4}
• caption 5: {caption5}
• caption 6: {caption6}
• caption 1: {caption1}
• caption 2: {caption2}
• caption 3: {caption3}
• caption 4: {caption4}
• caption 5: {caption5}
• caption 6: {caption6}
Prompt-guided Summarization From each captioning models, six caption candidates were
aggregated and fed into a standardized GPT-4 prompt. The prompt requested a concise and clinically
coherent summary under the assumption that these captions describe the same medical image. This
helped filter out redundant or inconsistent information and unify expression styles across captions.
Chain-of-Thought Summarization In this variant, the prompt instructed the model to generate
step-by-step reasoning [
            <xref ref-type="bibr" rid="ref34">34</xref>
            ] before concluding the final summary. The intent was to increase factual
consistency by encouraging the model to align each summary point with the underlying clinical evidence
extracted from input captions. Empirically, this strategy improved alignment with structured medical
entities.
          </p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Caption Reranking</title>
          <p>
            To select the most appropriate caption among the generated candidates, we implemented a reranking
module based on three diferent metrics: BioMedCLIP-image-text alignment, BLEURT-self-consensus,
BioBERT centroid proximity. The overall framework of these reranking strategies is illustrated in
Figure 3.
BioMedCLIP-image-text alignment Each caption  is embedded into  with the BiomedCLIP
[
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] text encoder, and the corresponding image I is embedded into w using the image encoder. Cosine
similarity
BLEURT-self-consensus BLEURT [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] estimates sentence quality via a regression head over
BERTstyle embeddings. For caption  among n candidates, we compute the leave-one-out average
score =
which rewards captions that are semantically central to the hypothesis set and thus robust to outliers.
The final caption is selected by maximizing this self-consistency score:
BioBERT centroid proximity
The centroid
          </p>
          <p>
            All captions are embedded via BioBERT [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] as vectors 1, . . . , .
          </p>
          <p>ˆ = arg max score</p>
          <p>
            v =
1 ∑︁ v
 =1
represents the consensus sematic position. Each caption is then ranked based on Euclidean distance to
the centroid [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ].
          </p>
          <p>ˆ = arg min‖v − v‖</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setups</title>
        <p>
          Dataset We conduct experiments on the extended version of the ROCOv2 dataset[
          <xref ref-type="bibr" rid="ref28 ref35">28, 35</xref>
          ], specifically
curated for the ImageCLEFmedical 2025 Caption Prediction Task [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. Unlike the original ROCOv2 [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ],
this updated release includes additional manual annotations as well as a newly introduced test set for
the 2025 challenge. The dataset configuration difers from prior versions: the previous test set from
ROCOv2 has been reassigned as the validation set, and the prior validation set has been merged into
the training set. The newly collected 2025 test set contains unseen images to evaluate generalization
performance under updated task conditions. The resulting splits comprise 80,091 images for training,
17,277 for validation, and 19,267 for testing. Each image is associated with a manually curated caption
and UMLS concepts, making it suitable for both generation and concept detection tasks.
Evaluation Metrics Model performance is evaluated according to the oficial challenge protocol
using six metrics that assess both relevance and factuality. Relevance is assessed using BERTScore (Recall
with IDF), ROUGE-1 (F1), BLEURT and Image-text Similarity, with BERTScore is computed with the
microsoft/deberta-xlarge-mnli model using IDF scores derived from the test set and BLEURT using the
recommended BLEURT-20 checkpoint. Image-text Similarity is evaluated by independently extracting
embedding vectors for the image and its corresponding caption using the MedImageInsight [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] model,
followed by computing their cosine similarity. All relevance metrics are calculated on lowercase,
punctuation-free captions with numbers replaced by the token “number.” For factuality, UMLS Concept
F1 is computed using MedCAT and semantic type filtering via QuickUMLS, and AlignScore is used to
measure information consistency between predicted and reference captions based on RoBERTa-base
alignment. All scores are averaged over the entire test corpus.
        </p>
        <p>Model Settings Our system utilizes either BioMedCLIP or SigLIP2 as standalone vision encoders, or
their ensemble via channel-wise feature concatenation. The language decoder is Bio-Medical
LLaMA-38B, a domain-specific large language model. Visual features are processed by a 6-layer Q-Former with 32
learnable query tokens. The Q-Former maps from a 2304-dim input (concatenated encoder outputs) to
4096-dim embeddings compatible with the LLM. The model optionally includes auxiliary classification
heads that predict 2,478 UMLS concept labels and 21 coarse types. The total loss is computed as a
weighted sum of the captioning loss and the concept classification loss, where the weighting factor  is
empirically set to 0.1.</p>
        <p>
          The model was trained using the AdamW optimizer, with the learning rate linearly increased to 1e-4
during the first epoch and annealed to 1e-6 over a total of 10 epochs. Training was conducted on a
NVIDIA H100 GPU with a batch size of 16 and a gradient accumulation step of 2. During inference, we
employed beam search decoding with a beam width of 3, a repetition penalty of 2.5, a length penalty of
2.0, and a minimum and maximum output length of 8 and 64 tokens, respectively.
Image and Text Pre-processing To mitigate quality degradation caused by low-resolution inputs
(&lt;300×300) and overly bright borders, we implemented a two-stage pre-processing pipeline comprising
FAWDN-based 2× super-resolution and structure-aware inpainting described in 3.1. In addition, we
experimented with applying a Gaussian filter during image pre-processing and GPT-based back-translation
[
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] (English → Korean → English) for text augmentation; however, neither approach yielded notable
performance improvements.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Base Model Results</title>
        <p>We conducted ablation experiments to evaluate the impact of the dual encoder architecture and auxiliary
classification tasks on medical image captioning performance. The detailed performance comparison
across model variants is presented in Table 2. Compared to single-encoder baselines using either
BioMedCLIP (#1405) or SigLIP2 (#1407), the dual encoder model (#1673), which concatenates both
encoders along the channel dimension, consistently outperformed in terms of both relevance and factual
accuracy. On the test set, this model improved ROUGE-1 and BLEURT scores by up to +0.0107 and
+0.0081, respectively, while UMLS F1 increased by as much as +0.0119, demonstrating the efectiveness
of combining domain-specific and general-purpose visual representations.</p>
        <p>Building upon this, we introduced auxiliary classification heads for predicting UMLS concepts and
semantic types. Compared to the base dual encoder (#1673), the model with concept prediction (#1695)
achieved further gains across all major metrics, including an additional +0.0062 in ROUGE-1 and
+0.0048 in UMLS F1. These improvements underscore the value of explicitly modeling medical concepts,
which enhances the factual grounding of generated captions without sacrificing fluency. Among all
configurations, the dual encoder with auxiliary classification (#1695) achieved the strongest overall
performance, ranking first in four out of six evaluation metrics. These findings validate our architectural
choices: integrating heterogeneous visual features through a dual encoder and reinforcing clinical
relevance through concept-aware auxiliary tasks. Together, these components contribute to generating
more accurate, informative, and clinically coherent medical image captions.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Post-Processing Results</title>
        <p>We evaluate three caption reranking strategies, namely BioMedCLIP-base, BLEURT-base, and
bioBERTbase, by selecting the best caption among candidates generated from six base models. All three reranking
methods improve upon the base model outputs, confirming that post-processing plays a critical role in
enhancing caption quality. Among the methods, BLEURT-based reranking (#1965) achieved the highest
cosine similarity (0.9008) and BLEURT score (0.3186), while maintaining strong results across ROUGE
(0.2397) and UMLS F1 (0.1486). This suggests that selecting internally consistent captions—those that
align with the majority of candidate hypotheses—enhances both fluency and factual alignment.</p>
        <p>BioMedCLIP-based reranking (#1900) prioritized visual-semantic grounding, yielding the highest
ROUGE (0.2440) and competitive scores in ALIGN (0.1231) and UMLS F1 (0.1524). In contrast,
BioBERTbased reranking (#1944) produced the highest BERTScore (0.5854), along with balanced performance
across all metrics and the strongest UMLS F1 (0.1536). These results demonstrate that reranking not only
improves overall caption quality but also enables fine-grained control depending on whether fluency,
alignment, or clinical factuality is prioritized.</p>
        <p>In addition to reranking methods, GPT-4-guided summarization was assessed as an alternative
post-processing strategy. Despite the intuitive appeal of aggregating multiple candidate captions
into a single concise output, these summarization strategies underperformed relative to reranking in
our quantitative assessments. Specifically, both the prompt-guided and chain-of-thought prompting
approaches frequently exhibited reduced precision and occasionally introduced clinically irrelevant
or hallucinated content. These shortcomings were particularly evident in factual grounding metrics
such as ALIGN and UMLS F1, indicating that generative summarization may abstract away or omit key
clinical entities during compression. A detailed comparison of these limitations is provided in Table 4.</p>
        <p>In summary, each reranking strategy exhibits distinct advantages: BLEURT-base excels in fluency
and self-consistency, BioMedCLIP-base in vision-language alignment, and bioBERT-base in semantic
grounding and metric balance. These results highlight that the choice of reranking method should
depend on the specific priorities of the medical captioning application. Given that both relevance and
factuality were key evaluation metrics in the ImageCLEF2025 challenge, the post-processing approach
based on image-text alignment using BioMedCLIP achieved the highest performance. This submission,
made under the team name AI Stat Lab with submission ID #1900, achieved an overall score of 0.3229
and ranked third on the oficial leaderboard.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>We introduced a modular medical-image captioning pipeline that unifies three components: (i) dual
vision encoders (BioMedCLIP + SigLIP2) to fuse domain-specific and general visual knowledge, (ii)
a multitask loss that aligns captions with 2,478 UMLS concepts and 21 semantic types, and (iii) a
metric-aware reranker that selects the most faithful hypothesis among six candidates. Specifically, our
best submission (#1900) was constructed by applying BioMedCLIP-based reranking to the pool of six
candidates generated from six base models (submissions #1405, #1407, #1673, #1693, #1694, and #1695,
in Table 2).</p>
      <p>On the ROCOv2 benchmark our system surpasses single-encoder and concept-agnostic baselines
on every shared-task metric, BERTScore, ROUGE-1, BLEURT, ALIGN, and UMLS-F1, demonstrating
simultaneous gains in linguistic relevance and clinical factuality. Among the post-processing strategies,
BioBERT-centric reranking achieves the best harmonic mean of relevance and factuality, whereas
BioMedCLIP-based scoring ofers the highest image-text alignment, highlighting a trade-of that can be
tuned to downstream needs.</p>
      <p>These results validate two key insights: (1) heterogeneous visual encoders supply complementary
features that improve descriptive richness, and (2) explicit concept supervision curbs hallucinations
and improves diagnostic grounding. The proposed pipeline establishes a strong baseline for upcoming
CLEF-Cap tasks and paves the way for future work on longitudinal captioning, device detection, and
lightweight on-device deployment in clinical settings.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the National Research Foundation of Korea(NRF) grant funded by the
Korea government(MSIT)(RS-2024-00360176).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used OpenAI GPT-4o in order to: refine writing
style, reorganize paragraph structure, and assist with technical language formulation. After using this
tool, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the
publication’s content.</p>
    </sec>
    <sec id="sec-8">
      <title>Appendix A. Summarization-based Refinement Results</title>
      <p>While summarization-based refinement using GPT models provides a promising approach for
aggregating multiple candidate captions into a single concise output, as summarized in Table 4, our experiments
reveal that this strategy underperforms compared to reranking-based methods in terms of factual
alignment and clinical adequacy.</p>
      <p>The best summarization method — CoT-based refinement (#1938) — achieves a BERTScore of 0.5705,
BLEURT of 0.3197, ALIGN of 0.0843, and UMLS F1 of 0.1236. In comparison, the BioMedCLIP-based
reranking method (Table 3), #1900) outperforms it across all key metrics, achieving a higher BERTScore
Prompt
guided</p>
      <p>Image-text BERTScore</p>
      <p>SIM (Recall)</p>
      <p>ROUGE-1</p>
      <p>BLEURT</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Beddiar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oussalah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Seppänen</surname>
          </string-name>
          ,
          <article-title>Automatic captioning for medical imaging (mic): a rapid review of literature</article-title>
          ,
          <source>Artificial intelligence review 56</source>
          (
          <year>2023</year>
          )
          <fpage>4019</fpage>
          -
          <lpage>4076</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Alzubaidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <article-title>A comparison review of transfer learning and self-supervised learning: Definitions, applications, advantages and limitations</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>242</volume>
          (
          <year>2024</year>
          )
          <fpage>122807</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Limbu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          , Medblip:
          <article-title>Fine-tuning blip for medical image captioning</article-title>
          ,
          <source>arXiv preprint arXiv:2505.14726</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Guan</surname>
          </string-name>
          , M. Liu,
          <article-title>Domain adaptation for medical image analysis: a survey</article-title>
          ,
          <source>IEEE Transactions on Biomedical Engineering</source>
          <volume>69</volume>
          (
          <year>2021</year>
          )
          <fpage>1173</fpage>
          -
          <lpage>1185</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Umirzakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. U.</given-names>
            <surname>Khan</surname>
          </string-name>
          , T. Whangbo,
          <article-title>Medical image super-resolution for smart healthcare applications: A comprehensive survey</article-title>
          ,
          <source>Information Fusion</source>
          <volume>103</volume>
          (
          <year>2024</year>
          )
          <fpage>102075</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pérez-García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bond-Taylor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bouzid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Salvatelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ilse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bannur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schwaighofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Lungren</surname>
          </string-name>
          , et al.,
          <article-title>Exploring scalable medical image encoders beyond text supervision</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Dillman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Radclip: Enhancing radiologic image analysis through contrastive language-image pretraining</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kaliosis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Charalampakos</surname>
          </string-name>
          , G. Moschovis,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <article-title>A data-driven guided decoding mechanism for diagnostic captioning</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>7450</fpage>
          -
          <lpage>7466</lpage>
          . URL: https://aclanthology. org/
          <year>2024</year>
          .findings-acl.
          <volume>444</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .findings-acl.
          <volume>444</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Jeon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Anisetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>A trusted medical image super-resolution method based on feedback adaptive weighted dense network</article-title>
          ,
          <source>Artificial Intelligence in Medicine</source>
          <volume>106</volume>
          (
          <year>2020</year>
          )
          <fpage>101857</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Telea</surname>
          </string-name>
          ,
          <article-title>An image inpainting technique based on the fast marching method</article-title>
          ,
          <source>Journal of graphics tools 9</source>
          (
          <year>2004</year>
          )
          <fpage>23</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tschannen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gritsenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Naeem</surname>
          </string-name>
          , I. Alabdulmohsin,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parthasarathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mustafa</surname>
          </string-name>
          , et al.,
          <source>Siglip</source>
          <volume>2</volume>
          :
          <string-name>
            <surname>Multilingual</surname>
          </string-name>
          vision
          <article-title>-language encoders with improved semantic understanding, localization, and dense features</article-title>
          ,
          <source>arXiv preprint arXiv:2502.14786</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bagga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Preston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Valluri</surname>
          </string-name>
          , et al.,
          <article-title>Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs</article-title>
          ,
          <source>arXiv preprint arXiv:2303.00915</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Wang, Sam-guided enhanced ifne-grained encoding with mixed semantic learning for medical image captioning</article-title>
          ,
          <source>in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>1731</fpage>
          -
          <lpage>1735</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>19730</fpage>
          -
          <lpage>19742</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <article-title>ContactDoctor, ContactDoctor-Bio-Medical: A High-Performance Biomedical Language Model</article-title>
          , https://huggingface.co/ContactDoctor/Bio-Medical-Llama-
          <volume>3</volume>
          -8B,
          <year>2024</year>
          . Accessed:
          <fpage>2025</fpage>
          -06-16.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Myers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vijayanarasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Canny,</surname>
          </string-name>
          <article-title>Ic3: Image captioning by committee consensus</article-title>
          ,
          <source>arXiv preprint arXiv:2302.01328</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellam</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
          </string-name>
          ,
          <article-title>Bleurt: Learning robust metrics for text generation</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>04696</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lamsiyah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El Mahdaouy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Espinasse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. E. A.</given-names>
            <surname>Ouatik</surname>
          </string-name>
          ,
          <article-title>An unsupervised method for extractive multi-document summarization based on centroid approach and sentence embeddings</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>167</volume>
          (
          <year>2021</year>
          )
          <fpage>114152</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Biobert: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <year>2020</year>
          )
          <fpage>1234</fpage>
          -
          <lpage>1240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>E.</given-names>
            <surname>Alsentzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Boag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-H.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jindi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>McDermott, Publicly available clinical BERT embeddings</article-title>
          , in: A.
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
          </string-name>
          , T. Naumann (Eds.),
          <source>Proceedings of the 2nd Clinical Natural Language Processing Workshop</source>
          , Association for Computational Linguistics, Minneapolis, Minnesota, USA,
          <year>2019</year>
          , pp.
          <fpage>72</fpage>
          -
          <lpage>78</lpage>
          . URL: https:// aclanthology.org/W19-1909/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W19</fpage>
          -1909.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>S</surname>
          </string-name>
          . Tian,
          <string-name>
            <surname>J. Zhang,</surname>
          </string-name>
          <article-title>A pubmedbert-based classifier with data augmentation strategy for detecting medication mentions in tweets</article-title>
          ,
          <source>arXiv preprint arXiv:2112.02998</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>R.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Poon, T.-Y. Liu,
          <article-title>Biogpt: generative pre-trained transformer for biomedical text generation and mining</article-title>
          ,
          <source>Briefings in bioinformatics 23</source>
          (
          <year>2022</year>
          )
          <article-title>bbac409</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Akkaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Aleman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anadkat</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report, arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
          </string-name>
          , et al.,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.13971</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Artzi,</surname>
          </string-name>
          <article-title>BERTScore: Evaluating text generation with BERT</article-title>
          , in: ICLR,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Seyhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Sokucu</surname>
          </string-name>
          , G. Gunluoglu,
          <string-name>
            <given-names>N. S.</given-names>
            <surname>Veske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Altin</surname>
          </string-name>
          ,
          <article-title>Primary pulmonary synovial sarcoma: a very rare presentation</article-title>
          ,
          <source>Case Reports in Pulmonology</source>
          <year>2014</year>
          (
          <year>2014</year>
          )
          <fpage>537618</fpage>
          -
          <lpage>537618</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>O. N.</given-names>
            <surname>Al</surname>
          </string-name>
          <string-name>
            <surname>Mulhim</surname>
          </string-name>
          ,
          <article-title>Huge thoracic aortic aneurysm presenting with jaundice: A case report</article-title>
          ,
          <source>Vascular Health and Risk Management</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eryilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>M. Friedrich, Overview of ImageCLEFmedical 2025 - medical concept detection and interpretable caption generation</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-C.</given-names>
            <surname>Stanciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-G.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prokopchuk</surname>
          </string-name>
          , Ştefan, LiviuDaniel, M.-G. Constantin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Eryilmaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Becker</surname>
          </string-name>
          , W.-W. Yim,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>H. M.</given-names>
          </string-name>
          <string-name>
            <surname>Shan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Nakov</surname>
            , I. Koychev,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gautam</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fabre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macaire</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lecouteux</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Heinrich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kiesel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wolter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
          </string-name>
          , Overview of ImageCLEF 2025:
          <article-title>Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>E.</given-names>
            <surname>Hirsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Dawidowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tal</surname>
          </string-name>
          , Medrat:
          <article-title>Unpaired medical report generation via auxiliary tasks</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          , et al.,
          <source>The llama 3 herd of models, arXiv preprint arXiv:2407.21783</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Lora: Low-rank adaptation of large language models</article-title>
          .
          <source>, ICLR</source>
          <volume>1</volume>
          (
          <year>2022</year>
          )
          <article-title>3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>O.</given-names>
            <surname>Bodenreider</surname>
          </string-name>
          ,
          <article-title>The unified medical language system (umls): integrating biomedical terminology</article-title>
          ,
          <source>Nucleic acids research</source>
          <volume>32</volume>
          (
          <year>2004</year>
          )
          <fpage>D267</fpage>
          -
          <lpage>D270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koitka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            , H. Müller,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Horn</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Nensa</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>M. Friedrich, ROCOv2: Radiology objects in context version 2, an updated multimodal image dataset</article-title>
          ,
          <source>Scientific Data</source>
          <volume>11</volume>
          (
          <year>2024</year>
          ).
          <source>doi:10.1038/s41597-024-03496-6.</source>
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Santamaria-Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Guyman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sangani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Medimageinsight: An open-source embedding model for general domain medical imaging</article-title>
          ,
          <source>arXiv preprint arXiv:2410.06542</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>R.</given-names>
            <surname>Sennrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haddow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Birch</surname>
          </string-name>
          ,
          <article-title>Improving neural machine translation models with monolingual data</article-title>
          ,
          <source>arXiv preprint arXiv:1511.06709</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>