<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Advancing Vision and Language in GI Diagnosis: Florence2 for Question Answering and Stable Difusion for Image Synthesis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Krishna Tewari</string-name>
          <email>krishnatewari.rs.cse24@itbhu.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sukomal Pal</string-name>
          <email>spal.cse@itbhu.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology(BHU) Varanasi</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recent advances in medical AI have underscored the importance of Visual Question Answering (VQA) and medical image generation. VQA systems enable automated reasoning over medical images using natural language queries, enhancing clinical interpretability. Meanwhile, generative models synthesize realistic medical images from textual descriptions, supporting data augmentation, simulation training, and rare case generation; particularly valuable in low-resource domains. Though evaluated independently in this challenge, these tasks are inherently complementary. VQA can aid in semantically annotating synthetic images, while synthetic images can enrich datasets to improve VQA model training. Together, they pave the way for robust multimodal diagnostic systems. In the ImageCLEFmed-MEDVQA-GI 2025 challenge, we address both subtasks: (1) closed-domain VQA and (2) medical image generation in the gastrointestinal (GI) domain. For VQA, we fine-tuned Microsoft's Florence2 vision-language transformer on the Kvasir-VQA dataset, using a custom preprocessing pipeline to remove specular highlights and black borders. Evaluation on the test sets yielded BLEU scores of 0.24/0.22, ROUGE-L scores of 0.87/0.88, and METEOR scores of 0.48/0.49, demonstrating strong domain-specific performance. For image generation, we fine-tuned Stable Difusion with LoRA using synthetic captions produced by the QwQ language model. The model generated high-resolution (768×768) GI images. Evaluation on the private test set achieved a fidelity score of 0.2739, agreement of 0.739, and diversity of 0.6481, indicating high-quality synthesis. Our approach integrates fine-tuned VQA and difusion models in a reproducible multimodal framework, advancing clinical image interpretation and dataset enrichment in low-resource GI healthcare.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Medical VQA</kwd>
        <kwd>ImageCLEFmed 2025</kwd>
        <kwd>Multimodal AI</kwd>
        <kwd>Clinical Question Answering</kwd>
        <kwd>Synthetic GI Images</kwd>
        <kwd>Specular Highlight Removal</kwd>
        <kwd>Difusion Models</kwd>
        <kwd>Florence2</kwd>
        <kwd>Low-Rank Adaptation (LoRA)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, gastrointestinal (GI) image analysis has emerged as a cornerstone of diagnostic medicine
due to the rising global burden of GI diseases. Colorectal cancer alone accounts for more than 1.9 million
new cases and over 935,000 deaths annually, making it the third most deadly cancer worldwide1.
Highresolution endoscopic techniques such as colonoscopy and capsule endoscopy generate vast quantities
of visual data that require precise, real-time interpretation. Manual analysis is time-consuming and
susceptible to variability. This motivates the need for automated, intelligent image interpretation
systems. With the advancement of computational imaging, deep learning, and image-guided diagnostics,
AI-driven tools now aid in the detection of polyps, ulcers, and inflammatory markers with increasing
accuracy [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Additionally, the shift toward data-driven healthcare has accelerated the adoption of GI
image analysis platforms which aim to reduce diagnostic delays and enhance early disease detection [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Visual Question Answering (VQA) plays a critical role in medical AI by bridging the gap between
complex visual data and clinical decision-making through natural language interaction. In the medical
domain, VQA systems are trained to interpret an image  and answer a clinician’s question  by
producing an answer , formalized as  : (, ) → . By leveraging state-of-the-art transformer
architectures, these systems encode joint embeddings z = (, ) that capture intricate relationships
between image features and textual queries [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This facilitates interpretable and context-aware analysis
of subtle pathological findings, which may be overlooked or misinterpreted during manual review.
Consequently, VQA enhances diagnostic accuracy and eficiency, enabling clinicians to obtain rapid,
and reliable answers that assist real-time decision-making.
      </p>
      <p>
        In contrast, synthetic image generation addresses a complementary yet equally vital challenge: the
scarcity and imbalance of annotated medical imaging datasets. Difusion-based generative models learn
to approximate the conditional distribution  (x0|) of realistic images x0 given clinical captions 
by progressively denoising latent variables z over discrete time steps  [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This enables the creation
of diverse, high-fidelity GI images that represent rare pathologies or underrepresented anatomical
variations. Therefore, this circumvents the privacy issues inherent in using real data. These synthetic
datasets not only expand the training resources for AI models but also allow systematic evaluation
of diagnostic algorithms across a broader clinical spectrum. Importantly, when integrated with VQA
pipelines, synthetic generation supports the creation of well-annotated multimodal data, reinforcing
model robustness and clinical relevance [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Together, VQA and synthetic image generation serve distinct but synergistic functions within medical
AI: VQA enhances clinical interpretability and decision support by enabling interactive image
understanding, while synthetic generation expands and balances training data to improve model generalization
and reliability.</p>
      <p>
        To foster progress at this intersection of vision-language understanding and image synthesis, the
ImageCLEFmed-MEDVQA-GI 2025 challenge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] introduced a unique dual-task benchmark in the GI
domain. This year’s challenge focused on two core subtasks:
• Subtask 1 (Closed-domain VQA): Participants were required to develop models capable of
answering medical questions grounded in GI endoscopic images, using datasets annotated with
question-answer pairs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. It presented the need to leverage multimodal learning approaches that
jointly analyze visual data and textual queries in order to provide responses.
• Subtask 2 (Synthetic Image Generation): This task involved generating synthetic GI images
conditioned on clinical text prompts, such as descriptions of polyp types, anatomical landmarks,
or procedural findings [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Both subtasks aim to address pressing challenges in the clinical AI landscape: Subtask 1 enhances
interpretability and decision support, while Subtask 2 enables scalable data augmentation and training
for low-resource tasks.</p>
      <p>
        In response to this challenge, we participated in both subtasks independently by employing
stateof-the-art models tailored to each task. For Subtask 1, we fine-tuned Microsoft’s Florence2, a robust
vision-language transformer, on the Kvasir-VQA dataset. To improve the quality of the visual input, we
applied a specialized preprocessing pipeline addressing common endoscopic image artifacts such as
specular highlights and black masks. Mixed-precision training was used to balance model performance
and computational eficiency [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. For Subtask 2, we fine-tuned Stable Difusion v2-1 using Low-Rank
Adaptation (LoRA), which allows eficient adaptation of large difusion models with minimal resource
demands [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This approach enabled the synthesis of high-resolution, clinically relevant GI images,
enhancing the diversity and realism of generated medical data.
      </p>
      <p>Through these eforts, we showcased the efectiveness of leveraging advanced vision-language and
generative models for distinct yet complementary challenges in GI imaging, contributing to improved
interpretability and data augmentation in clinical applications.</p>
      <p>The rest of the paper is structured as follows. Section 2 provides a concise overview of prior research
in this field. In Section 3, we provide the task overview and then describe the datasets used. Section
4 elaborates on our computational methodologies, and model specifications. Next, we explain the
evaluation methodology, present our results and conduct a comprehensive analysis in Section 5. Finally,
we conclude in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Recent advances in medical artificial intelligence have increasingly emphasized the integration of VQA
and synthetic image generation to improve diagnostic accuracy, especially within the GI domain. This
integration involves complex multimodal reasoning, combining visual data from medical images with
natural language questions to generate clinically relevant answers.</p>
      <p>
        Pioneering datasets like VQA-RAD [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] established early benchmarks by pairing radiological images
with clinically meaningful questions and answers. This dataset enabled models to learn how to interpret
medical images while simultaneously reasoning about associated clinical queries, focusing on
radiologyspecific pathologies. Building upon this, PathVQA [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] extended the framework to pathology by
compiling over 32,000 question-answer pairs based on microscopic histopathological images. This
allowed for deeper reasoning about cellular and tissue-level abnormalities, enhancing model capabilities
to understand fine-grained pathological features. In the specific context of GI endoscopy,
KvasirVQA [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] curated a large-scale dataset featuring endoscopic images paired with multiple question types,
including anatomical, pathological, and procedural questions.
      </p>
      <p>
        Simultaneously, the generation of synthetic medical images has become a critical technique for
overcoming data scarcity and imbalance, particularly in rare disease classes that lack suficient
realworld examples. Generative Adversarial Networks (GANs) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] represent a landmark approach wherein
a generator network produces synthetic images, which are then judged by a discriminator network that
learns to distinguish real from fake images. While GANs have been efective in many domains, their
training often sufers from instability and mode collapse. This limits their capacity to produce diverse,
high-fidelity medical images reliably.
      </p>
      <p>
        To address these limitations, difusion models [
        <xref ref-type="bibr" rid="ref14 ref4">4, 14</xref>
        ] have emerged as a powerful alternative. These
models learn a reverse denoising process that transforms random noise into realistic images through a
series of iterative refinement steps. Recent studies in medical imaging [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] demonstrate that difusion
models generate synthetic images with greater anatomical coherence and visual diversity than GANs,
making them particularly suitable for augmenting GI imaging datasets.
      </p>
      <p>
        On the front of vision-language integration, models like CLIP [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] learn joint embeddings of images
and text by training on large-scale web data, enabling zero-shot image recognition and open-domain
image-text reasoning. Similarly, BLIP (Bootstrapping Language-Image Pre-training) improves captioning
and VQA by aligning vision and language features through a combination of contrastive and generative
objectives. The Flamingo [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] model further extends this by enabling few-shot learning and open-ended
multimodal reasoning using a large-scale transformer architecture.
      </p>
      <p>In medical AI, domain-specific adaptations such as Med-Flamingo [ 18] have been proposed.
MedFlamingo fine-tunes large pretrained vision-language models on medical image-text pairs to generate
clinical rationales. It also performs VQA with limited labeled data, capturing domain-specific semantic
nuances. Despite these advances, these models often require extensive fine-tuning, which can be
computationally prohibitive, and they may sufer from domain gaps because the base models are
primarily trained on natural images and general language corpora.</p>
      <p>
        To mitigate these challenges, parameter-eficient fine-tuning (PEFT) techniques such as LoRA [ 19]
enable the adaptation of large pretrained models by training only a small number of additional low-rank
parameter matrices. This significantly reduces the number of trainable parameters and computational
resources while maintaining performance. PEFT approaches have shown promising results in medical
imaging tasks [
        <xref ref-type="bibr" rid="ref15">20, 15</xref>
        ], allowing resource-eficient transfer learning but remain underexplored in complex
multimodal GI VQA systems.
      </p>
      <p>The recent ImageCLEFmed-MEDVQA challenges in 2023 [21] and 2024 [22] have driven
state-of-theart progress by encouraging innovative approaches to multimodal learning on GI endoscopic images and
text queries. The 2023 winning team, UIT-Saviors [23], enhanced model performance by applying image
enhancement techniques such as contrast adjustment and noise reduction to improve the visibility of
ifne endoscopic features before multimodal fusion with text. This preprocessing step enabled more
precise visual feature extraction crucial for answering diagnostic questions. In 2024, the top solutions
incorporated fine-tuned difusion models for synthetic image augmentation alongside PEFT strategies
for adapting large multimodal transformers.</p>
      <p>Collectively, these research eforts exemplify a converging paradigm of leveraging advanced
multimodal architectures, computationally eficient fine-tuning, and high-fidelity synthetic data generation.</p>
      <p>Building upon these foundations, our work addresses two seperate subtasks i.e. VQA and
highifdelity synthetic image generation within the GI tract. For VQA, we utilize the Florence generative
vision-language model alongside a robust preprocessing pipeline tailored to endoscopic images. For
image generation, we fine-tune difusion models with LoRA for eficient adaptation. This dual-subtask
framework enriches interpretability and data diversity, advancing the state-of-the-art in multimodal
learning for GI healthcare applications.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Overview and Dataset</title>
      <p>This section outlines the overall objectives of the task and presents a detailed description of the dataset
used to perform our experimental analysis.</p>
      <sec id="sec-3-1">
        <title>3.1. Task Overview</title>
        <p>We participated in both Subtask 1 and Subtask 2 of the ImageCLEFmed-MEDVQA-GI 2025 challenge [24],
which comprises the following components:
• VQA: This subtask requires generating textual answers based on image-question pairs.
Participants must develop systems capable of interpreting endoscopic images in conjunction with
corresponding clinical questions to produce accurate textual responses. For example, given an
image showing a colon polyp and the question, “Where in the image is the polyp located?”, the
system should output a relevant answer such as “upper-left” or “in the center of the image.”
• Image Generation: This subtask involves building models that transform clinical text
descriptions into high-fidelity synthetic GI images. The objective is to generate synthetic outputs
that closely resemble real endoscopic images, such as those obtained through colonoscopy or
gastroscopy, while preserving anatomical accuracy and clinical variability.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset Information</title>
        <p>
          The dataset utilized in this challenge is Kvasir-VQA [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], a publicly available, expert-annotated dataset
designed to advance research in medical VQA and related tasks within the GI diagnostic domain. The
dataset is an extension of the HyperKvasir [25] and Kvasir-Instrument [26] datasets, enriched with
natural language question-and-answer (QA) annotations to support multimodal learning tasks.
        </p>
        <p>Kvasir-VQA consists of 6,500 high-resolution endoscopic images, each annotated with multiple
clinically relevant questions spanning various anatomical regions and pathological findings of the GI
tract. The dataset includes images of conditions such as polyps, ulcers, esophagitis, and scenes with
surgical instruments, ensuring rich diversity for robust model training.</p>
        <p>Each image is paired with multiple QA instances, resulting in over 38,500 VQA pairs. These questions
cover:
• Yes/No queries (e.g., “Is there a polyp?”),
• Categorical choices (e.g., “What type of lesion is shown?”),
• Spatial location (e.g., “Where is the abnormality?”),
• Numerical counts (e.g., “How many instruments are visible?”).</p>
        <p>These QA pairs are grounded in real clinical reasoning, making the dataset well-suited for medical AI
applications. The answer formats include binary responses, categorical labels, ordinal values, and spatial
descriptors, simulating real-world diagnostic ambiguity and diversity. Figure 1 shows some sample
images along with QA pairs.</p>
        <p>Question: Where in the image is Question: Are there any
abnorthe instrument? malities in the image? Check all Question: How many findings
Answer: center; center-left; that are present. are present?
lower-center; lower-left Answer: ulcerative colitis Answer: 1</p>
        <p>For the image generation subtask, Kvasir-VQA dataset is utilized, incorporating the same endoscopic
images along with synthetically generated clinical captions2. Figure 2 shows a sample image and Table
1 shows sample synthetically generated captions associated to it. These captions simulate realistic
diagnostic language and are designed to support conditional image synthesis tasks in the medical
domain.
2https://raw.githubusercontent.com/simula/ImageCLEFmed-MEDVQA-GI-2025/refs/heads/main/kvasir-captions.json</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>This section presents our end-to-end methodology for solving the two core tasks: VQA and Image
Generation. Inspired by recent advances in multimodal deep learning and guided by domain-specific
requirements of GI imaging, we propose a system that combines vision-language fusion and generative
modeling, supported by extensive preprocessing and PEFT.</p>
      <sec id="sec-4-1">
        <title>4.1. VQA Pipeline (Subtask 1)</title>
        <p>Our approach to the VQA task builds upon the Florence-2-base-ft model, a vision-language transformer
designed for causal language modeling with integrated image and text understanding as shown in
Figure 3. Each input pair consists of a clinically relevant natural language question and an associated
colonoscopy or gastroscopy image. Prior to ingestion into the model, we perform crucial image
preprocessing to address challenges inherent to endoscopic imaging.</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Preprocessing</title>
          <p>Colonoscopy images are often afected by optical and environmental artifacts that hinder automated
image analysis. Two of the most common and disruptive issues include specular highlights, which
are high-intensity reflections caused by wet mucosal surfaces illuminated by the endoscope’s light
source, and black masks, which are peripheral dark regions caused by circular lens vignetting or camera
constraints. Both artifacts distort texture and shape cues vital for downstream tasks. Therefore, we
developed a dedicated image enhancement module that performs targeted correction of these issues.
Specular Highlight Removal: Specular highlights are regions of over-saturation where the reflected
light exceeds the sensor’s dynamic range. These areas lack semantic information and often introduce
noise into feature maps. We define the grayscale intensity of a given RGB image (, ) ∈ R3 as:
gray(, ) = 0.299 · (, ) + 0.587 · (, ) + 0.114 · (, )
(1)</p>
          <p>A binary mask  is computed by thresholding pixels above an empirically chosen intensity value
 = 240:</p>
          <p>This mask is then dilated using a morphological operator to expand bright regions and fill small gaps:
where  is a 5 × 5 square structuring element and ⊕ denotes dilation. We extract connected
components and discard those with area  &lt; 100 pixels to eliminate noise. The resulting regions
are filled using Telea’s inpainting algorithm [ 27], which solves the following Laplace equation with
Dirichlet boundary conditions:
div(∇* (, )) = 0,
∀(, ) ∈ Ω
where Ω is the inpainting domain and * (, ) is the reconstructed intensity map.
(, ) =
{︃1 if gray(, ) ≥</p>
          <p>0 otherwise
 =  ⊕ 
(2)
(3)
(4)
(5)
(6)
(7)
Black Mask Removal: Black masks occur due to the mismatch between the circular field of view
and the rectangular image sensor. To detect these regions, we apply a low-threshold operation on gray:
(, ) =
{︃1 if gray(, ) ≤</p>
          <p>0 otherwise
where  = 5 is the lower bound for black region detection. The bounding box of the foreground
(non-black) region is computed as (min, max, min, max), and a padding margin  = 5 is added to
avoid overcropping:
′min = max(0, min −  ), ′max = min(, max +  )
m′in = max(0, min −  ), m′ax = min(, max +  )</p>
          <p>The resulting black regions are also reconstructed using Telea’s method, ensuring a smooth transition
between content and background.</p>
          <p>These preprocessing methods significantly improved downstream VQA by normalizing image
structure and reducing noise, consistent with findings in prior work [ 23, 28]. The threshold values  and 
were selected through visual inspection over a random subset of development images to ensure maximal
removal of unwanted artifacts while preserving diagnostically relevant features. Further automated
tuning could be explored in future iterations. While additional enhancements such as contrast-limited
adaptive histogram equalization (CLAHE), denoising, and sharpening were implemented, they were
excluded from the final training phase to avoid overfitting to synthetic enhancements.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Model Architecture</title>
          <p>Following preprocessing, images and tokenized text inputs are passed into the Florence-23 model.
The image is processed through a transformer-based vision encoder, which extracts high-dimensional
spatial features. Simultaneously, the question is encoded through a causal text encoder adapted for
token-level language modeling. The two streams are fused via cross-attention layers within the model’s
architecture, allowing text tokens to dynamically attend to semantically relevant image regions. This
fusion mechanism enables the model to align language components such as anatomical references,
colors, or object types with visual counterparts in the image. Unlike traditional classification-based
MedVQA systems, our model produces free-form text responses in a generative fashion. This is
3https://huggingface.co/microsoft/Florence-2-base
particularly advantageous for variable-length answers, such as anatomical descriptions or answers
involving compound attributes (e.g., “Red and pink in the lower-left quadrant").</p>
          <p>Training of the VQA model was performed using Hugging Face’s Trainer class. The model was
ifne-tuned over 3 epochs with a learning rate of 3e-5, using the AdamW optimizer with a weight
decay factor of 0.01. Due to memory constraints and model size, we set the per-device batch size
to 3 and used gradient accumulation over 16 steps. Mixed precision training (fp16) was enabled to
optimize GPU memory usage. All training experiments were conducted on one NVIDIA H100 GPU
(80GB VRAM), ensuring the model had suficient computational capacity for large-scale multimodal
learning. The final model was saved and uploaded to the Hugging Face Model Hub under the repository
krissTewari/Florence-2-vqa-final 4.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Image Generation Pipeline (Subtask 2)</title>
        <p>For Subtask 2, we developed a dedicated image generation pipeline capable of synthesizing realistic
endoscopic images from structured diagnostic captions as shown in Figure 4.</p>
        <p>We used Stable Difusion v2-1 5 as the base generative model, a state-of-the-art latent difusion model
pretrained on billions of text-image pairs. Unlike traditional pixel-space generators, Stable Difusion
operates within a latent space learned by a variational autoencoder (VAE), which enables high-resolution
generation with reduced computational overhead. The model was congfiured to generate images at a
resolution of 768×768 pixels, which balances visual fidelity and computational feasibility in the medical
domain, especially where fine anatomical structures must be preserved.</p>
        <p>The dataset used for fine-tuning consisted of caption-image pairs from the development dataset of
the ImageCLEFmed-MEDVQA-GI task. These captions are natural language descriptions of clinically
relevant image content, including the presence, size, shape, color, and location of abnormalities (e.g.,
polyps), as well as the presence of instruments or anatomical landmarks.</p>
        <p>To assess the best-suited backbone for domain-specific image synthesis, we initially evaluated three
model variants: Stable Difusion v1-5 6, Stable Difusion XL (SDXL) 7, and Stable Difusion v2-1. While
SDXL demonstrated promising compositional quality due to its expanded architecture and 1024×1024
latent space, it incurred significantly higher VRAM requirements and longer training times, making
it impractical for constrained medical datasets. Stable Difusion v1-5, while computationally eficient,
showed limitations in accurately capturing fine-grained medical textures and anatomical features. In
contrast, Stable Difusion v2-1 ofered a strong trade-of between resolution (768 ×768), performance,
and memory usage, and thus was selected for the final training and submission phase.</p>
        <p>In order to adapt the pretrained Stable Difusion model to this highly specialized domain while
avoiding the need to update the full set of model parameters, we employed LoRA. During training, only
the inserted matrices are updated, while the base model weights remain frozen. In our implementation,
4https://huggingface.co/krissTewari/Florence-2-vqa-final
5https://huggingface.co/stabilityai/stable-difusion-2-1
6https://huggingface.co/runwayml/stable-diufsion-v1-5
7https://huggingface.co/stabilityai/stable-difusion-xl-base-1.0
LoRA was configured with a rank of 4. This choice was guided by prior work and common practice in
parameter-eficient fine-tuning (PEFT) for difusion models. We also conducted preliminary experiments
with LoRA ranks of 2, 4, and 8. Rank 4 provided the best trade-of between parameter eficiency and
generation quality without exceeding memory constraints. Higher ranks showed minimal improvement
in fidelity but significantly increased VRAM usage. We applied LoRA modules specifically to the
cross-attention layers within the UNet, which allows the model to efectively learn how to condition the
image synthesis on medical language prompts without catastrophic forgetting of general-domain image
priors. This method significantly reduced VRAM usage and made training feasible on commercially
available hardware without degrading generative quality.</p>
        <p>The training process was orchestrated using Hugging Face’s accelerate launcher in conjunction with
the difusers and peft libraries. Training was conducted over 3 full epochs, with a batch size of 4 and
gradient accumulation over 4 steps. We used a constant learning rate of 1e-4, which was selected based
on prior literature and empirical performance on LoRA tuning. The training loop included automatic
checkpointing every 500 steps, with a maximum of three recent checkpoints retained to support
resumption and rollback. To monitor the qualitative progression of generation, we used validation
prompts during training. These prompts were constructed to mirror common VQA queries, one of
which is: “The colonoscopy image contains a single, moderate-sized polyp that has not been removed,
appearing in red and pink tones in the center and lower areas. This prompt captures both anatomical
features and visual attributes and serves as a reliable baseline for visual inspection during training.</p>
        <p>All training was carried out on a server equipped with four NVIDIA L40s GPUs, each ofering 48
GB of VRAM. The model was trained using mixed-precision (fp16) to reduce memory consumption
and improve throughput. The use of gradient checkpointing within the UNet further minimized the
peak memory footprint. Upon completion of training, the final LoRA adapter weights were pushed to
the Hugging Face Model Hub under the repository krissTewari/sd-kvasir-imagen-demo8, making them
publicly available for reproducibility and further research use.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In this section, we present a detailed evaluation of our proposed models on VQA and Image Generation
tasks. The assessment is conducted using standard evaluation metrics across both public and private
test sets to provide a comprehensive understanding of model performance.</p>
      <sec id="sec-5-1">
        <title>5.1. Evaluation Metrics</title>
        <p>The performance of our models was assessed using a comprehensive set of evaluation metrics tailored
to the distinct characteristics of VQA and Image Generation tasks. These metrics were selected to
provide a multidimensional understanding of linguistic quality, semantic alignment, and visual fidelity.</p>
        <sec id="sec-5-1-1">
          <title>5.1.1. Metrics for VQA</title>
          <p>• BLEU (Bilingual Evaluation Understudy): BLEU [29] measures the modified n-gram precision
of generated answers with respect to reference answers, incorporating a brevity penalty to
discourage short outputs:
where
8https://huggingface.co/krissTewari/sd-kvasir-imagen-demo</p>
          <p>BLEU-N =  · exp
︃( 
∑︁  log 
=1</p>
          <p>)︃
 =
{︃1</p>
          <p>if  &gt; 
(1− /) if  ≤ 
(8)
(9)
 is the length of the candidate,  is the reference length,  is the modified n-gram precision,
and  are weights (typically uniform).
• ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE [30] is a family
of metrics that evaluate n-gram and sequence overlaps. We employ ROUGE-1, ROUGE-2, and
ROUGE-L, where:</p>
          <p>ROUGE-N =
∑︀gram∈{Ref} min(Countmatch(gram), Countcand(gram))</p>
          <p>∑︀gram∈{Ref} CountRef(gram)
and ROUGE-L measures the Longest Common Subsequence (LCS) recall.
• METEOR (Metric for Evaluation of Translation with Explicit ORdering): METEOR [31]
combines unigram precision and recall with stemming, synonyms, and a fragmentation penalty:</p>
          <p>METEOR = mean · (1 − frag)
where mean is the harmonic mean of precision and recall, and frag penalizes fragmented matches.
(10)
(11)
(12)
(13)
(14)
(15)</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Metrics for Image Generation</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>5.1.3. Metrics for Image Generation</title>
          <p>In addition to expert ratings, four automated quantitative metrics are used to assess the realism,
consistency, and variability of generated images. These metrics leverage BiomedCLIP [? ] image
embeddings, providing a semantically grounded evaluation tailored to the medical imaging domain.
• Fidelity (^): Measures how realistic the generated images are compared to real colonoscopy
images. It is computed as a scaled inverse of the Fréchet Inception Distance (FID), using BiomedCLIP
embeddings:</p>
          <p>Fidelity =</p>
          <p>1000
1 + mean-FID(, )
where  and  denote the generated and real image features for prompt . Higher scores
indicate greater similarity to real images.
• Agreement (^): Evaluates semantic and visual consistency between images generated from
original and rephrased prompts. Agreement is calculated as the average cosine similarity between
BiomedCLIP embeddings of images from paired prompts:</p>
          <p>Agreement =</p>
          <p>1 ∑︁</p>
          <p>1

=1 |||| ∈,∈ ‖‖‖‖
∑︁
 · 
where  and  represent the image embedding sets from the original and rephrased prompts.
• Diversity (^): Quantifies intra-prompt variability among generated images, promoting diverse
outputs rather than mode collapse. For each prompt, we compute the average pairwise Euclidean
distance between the BiomedCLIP embeddings of the 10 generated images:
distance function.
embedding space:
where  is the set of normalized embeddings per prompt and pdist denotes the mean pairwise
• Fréchet BiomedCLIP Distance (FBD) (_): Assesses global distributional similarity between the
entire set of generated and real images using the Fréchet distance, computed in the BiomedCLIP
Diversity =

=1

1 ∑︁ pdist()
FBD = ‖ gen −  real‖2 + Tr Σ gen + Σ real − 2(Σ genΣ real)1/2)︁</p>
          <p>︁(
where  and Σ are the mean and covariance of embeddings from generated and real images, and
Tr denotes the matrix trace. Lower scores indicate better global alignment.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results</title>
        <p>This section presents the evaluation outcomes and detailed analysis of our approaches.
5.2.1. VQA</p>
        <p>As shown, the model achieves strong and consistent results across both datasets. On the public
test set, it attains a BLEU score of 0.24, ROUGE-1 and ROUGE-L scores of 0.87, while ROUGE-2 and
METEOR scores of 0.11 and 0.48, respectively. Performance on the private test set is comparably robust,
with slight variations showing BLEU at 0.22 and marginal improvements in ROUGE-1, ROUGE-L,
and METEOR scores. These results confirm the model’s reliable generalization and efectiveness in
answering domain-specific medical questions in GI imagery.</p>
        <p>Table 3 reports our ablation study i.e., the performance of the Florence2 model under diferent
configurations involving LoRA fine-tuning and image enhancement.</p>
        <p>When no LoRA fine-tuning or image enhancement is applied, the model achieves a BLEU score of
0.20, METEOR of 0.24, and ROUGE-L and ROUGE-1 scores near 0.45 and 0.46, respectively. Notably,
applying LoRA fine-tuning without prior image enhancement led to a significant drop in performance.
This suggests that the Florence2 model, which is pre-trained on clean natural images, may overfit to
visual noise when exposed to unprocessed endoscopic artifacts. The model might mistakenly associate
specular highlights or dark borders with specific answers, introducing spurious correlations. Therefore,
preprocessing appears essential before domain-specific fine-tuning, serving as a form of regularization.</p>
        <p>Applying image enhancement without LoRA fine-tuning improves results moderately, with BLEU
rising to 0.15 and ROUGE-L to 0.28, demonstrating the positive efect of improved visual inputs on
answer generation quality. The best results occur when LoRA fine-tuning is combined with image
enhancement, achieving a BLEU score of 0.24, METEOR of 0.48, and ROUGE-L and ROUGE-1 scores of
0.87 each, illustrating strong synergy between model adaptation and high-quality input preprocessing.</p>
        <p>In the ablation study, “With Image Enh.” refers exclusively to the application of specular highlight
and black mask removal. Other image enhancement techniques like CLAHE, sharpening, and denoising
were initially explored but later excluded from the final configuration. This decision was based on
qualitative assessments and pilot experiments, which showed negligible performance gains and a risk
of overfitting to artificially enhanced features.</p>
        <p>These findings emphasize that image enhancement consistently benefits VQA performance, while
LoRA fine-tuning is most efective when integrated with such preprocessing, ultimately enhancing the
domain-specific interpretability and accuracy of the Florence2 model on GI medical images.</p>
        <sec id="sec-5-2-1">
          <title>5.2.2. Image Generation</title>
          <p>Table 4 summarizes the performance of our nfie-tuned Stable Difusion v2-1 across the public and
private test sets.</p>
          <p>The fidelity scores of approximately 0.27 on both datasets indicate that the generated images closely
resemble real GI images in terms of visual quality and clinical relevance. Agreement scores near 0.74
demonstrate strong alignment between the generated images and the corresponding clinical captions,
validating the model’s ability to synthesize medically coherent images. Diversity scores above 0.64
reflect a healthy variety in the generated samples, important for capturing the range of possible clinical
presentations. The FBD values, which measure the distributional similarity between generated and real
image features, are lower on the private set (1694.97) compared to the public set (1923.16), suggesting
better realism and fidelity in the private evaluation. Overall, these results highlight the efectiveness of
our LoRA-fine-tuned Stable Difusion approach in producing high-resolution, clinically meaningful GI
images from textual descriptions.</p>
          <p>To evaluate the influence of diferent base models on image generation performance, we conducted
an ablation study comparing Stable Difusion v1-5, SDXL, and Stable Difusion v2-1 (our main model)
on the public test set. The results in Table 5 demonstrate that Stable Difusion v2-1 outperforms other
variants across critical metrics.</p>
          <p>Stable Difusion v2-1 achieves the highest agreement (0.74) score, indicating superior alignment with
clinical captions, while maintaining competitive fidelity (0.27) and diversity (0.66). SDXL demonstrates
a slightly better fidelity (0.28) but lower agreement and diversity, suggesting it produces high-quality
images that are somewhat less relevant or varied. Stable Difusion v1-5 shows higher diversity (0.74)
but comparatively lower fidelity and agreement. The Fréchet Brain Distance (FBD) values indicate that
v2-1 and v1-5 generate more realistic images compared to SDXL. Overall, these results support the
conclusion that Stable Difusion v2-1 is the best balanced model for GI medical image generation.
Visual Samples: Figure 5 showcases example outputs generated by our top-performing image
generation methodology, illustrating the quality and clinical relevance of the synthetic GI images.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>This paper presents two distinct contributions: a VQA system and a synthetic image generation model,
each addressing separate subtasks within the GI medical imaging domain.</p>
      <p>For the VQA subtask, we fine-tuned the Florence2 vision-language transformer on the Kvasir-VQA
dataset. The model was trained to generate accurate textual answers to clinical questions posed over
endoscopic images. To improve input quality, we introduced preprocessing techniques to remove
specular highlights and black borders, common artifacts in endoscopic videos. This enhanced the
model’s ability to align visual and textual features, resulting in strong performance across standard
language generation metrics including BLEU, ROUGE-L, and METEOR.</p>
      <p>For the image generation subtask, we fine-tuned Stable Difusion v2-1 using LoRA, enabling the
generation of high-resolution GI images from structured clinical captions. The model demonstrated
a strong ability to produce visually diverse and semantically accurate outputs, with favorable scores
across fidelity, agreement, and diversity metrics. This highlights the potential of difusion models for
synthetic data generation in low-resource medical settings.</p>
      <p>Future Work. Although the two tasks are independent, their outcomes suggest several promising
avenues for the research community, some of which are outlined below:
• Multilingual VQA systems: Expanding VQA models to handle clinical questions in multiple
languages could increase accessibility and support global deployment of AI-assisted diagnostics.
• Federated learning for privacy-preserving model training: Training vision-language and
generative models across distributed medical centers without transferring raw patient data would
enable broader collaboration while adhering to privacy regulations.
• Cross-task synergy: Although VQA and image generation were addressed as separate subtasks,
their interdependence presents a promising direction. For example, synthetic images generated
from clinical captions could be validated or further annotated by VQA models, enabling the
construction of weakly supervised multimodal datasets. Conversely, VQA outputs may guide
or condition future image generation pipelines. Exploring such interactions could lead to more
robust and context-aware medical AI systems.</p>
      <p>Continued research in these directions can contribute to the development of scalable, trustworthy,
and generalizable AI tools for GI diagnostics and other domains of medical imaging.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, Grammarly in order to: Grammar and
spelling check, Paraphrase and reword. After using these tools, the authors reviewed and edited the
content as needed and take full responsibility for the publication’s content.
[18] M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y. Dalmia, J. Leskovec, C. Zakka, E. P. Reis, P. Rajpurkar,
Med-flamingo: A multimodal medical few-shot learner, in: Proceedings of the 3rd Machine
Learning for Health Symposium, PMLR, 2023, pp. 353–367. URL: https://proceedings.mlr.press/
v225/moor23a.html.
[19] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Chen, W. Chen, Lora: Low-rank
adaptation of large language models, in: Proceedings of the 38th International Conference on
Machine Learning (ICML), PMLR, 2021, pp. 11113–11125.
[20] R. Dutt, L. Ericsson, P. Sanchez, S. A. Tsaftaris, T. Hospedales, Parameter-eficient fine-tuning for
medical image analysis: The missed opportunity, in: N. Burgos, C. Petitjean, M. Vakalopoulou,
S. Christodoulidis, P. Coupe, H. Delingette, C. Lartizien, D. Mateus (Eds.), Proceedings of The
7th International Conference on Medical Imaging with Deep Learning (MIDL), volume 250 of
Proceedings of Machine Learning Research, PMLR, 2024, pp. 406–425. URL: https://proceedings.mlr.
press/v250/dutt24a.html.
[21] S. A. Hicks, A. Storås, P. Halvorsen, T. de Lange, M. A. Riegler, V. Thambawita, Overview of
imageclefmedical 2023 – medical visual question answering for gastrointestinal tract, in: CLEF2023
Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Thessaloniki, Greece, 2023.
[22] S. A. Hicks, A. Storås, P. Halvorsen, M. A. Riegler, V. Thambawita, Overview of imageclefmedical
2024 – medical visual question answering for gastrointestinal tract, in: CLEF2024 Working Notes,
CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France, 2024.
[23] T. M. Thai, A. T. Vo, H. K. Tieu, L. N. P. Bui, T. T. B. Nguyen, Uit-saviors at medvqa-gi 2023:
Improving multimodal learning with image enhancement for gastrointestinal visual question
answering, in: Proceedings of the MEDVQA-GI Workshop 2023, 2023. URL: https://arxiv.org/abs/
2307.02783, arXiv preprint arXiv:2307.02783.
[24] S. Gautam, P. Halvorsen, M. A. Riegler, V. Thambawita, S. A. Hicks, Overview of imageclefmedical
2025 – medical visual question answering for gastrointestinal tract, in: CLEF2025 Working Notes,
CEUR Workshop Proceedings, CEUR-WS.org, Madrid, Spain, 2025.
[25] H. Borgli, V. Thambawita, P. H. Smedsrud, S. Hicks, D. Jha, S. L. Eskeland, K. R. Randel, K. Pogorelov,
M. Lux, D. T. D. Nguyen, D. Johansen, H. K. Johansen, M. A. Riegler, P. Halvorsen, HyperKvasir, a
comprehensive multi-class image and video dataset for gastrointestinal endoscopy, Scientific Data
7 (2020) 1–14. doi:10.1038/s41597-020-00622-y.
[26] D. Jha, S. Ali, K. Emanuelsen, S. A. Hicks, V. Thambawita, E. Garcia-Ceja, M. A. Riegler, T. de Lange,
P. T. Schmidt, D. Johansen, Kvasir-instrument: Diagnostic and therapeutic tool segmentation
dataset in gastrointestinal endoscopy, in: MultiMedia Modeling, Springer, 2021, pp. 218–229. URL:
https://datasets.simula.no/kvasir-instrument/. doi:10.1007/978-3-030-67835-7\_18.
[27] A. Telea, An image inpainting technique based on the fast marching method, Journal of graphics
tools 9 (2004) 23–34.
[28] Y. Kumar, B. Verma, R. Srivastava, Gastrointestinal abnormality detection in wireless capsule
endoscopy images using deep learning, Computerized Medical Imaging and Graphics 79 (2020)
101678.
[29] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational
Linguistics (ACL), 2002, pp. 311–318.
[30] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text Summarization</p>
      <p>Branches Out: Proceedings of the ACL-04 Workshop, 2004, pp. 74–81.
[31] S. Banerjee, A. Lavie, Meteor: An automatic metric for mt evaluation with improved correlation
with human judgments, in: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation
Measures for Machine Translation and/or Summarization, 2005, pp. 65–72.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Berbís</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aneiros-Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J. Mendoza</given-names>
            <surname>Olivares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Nava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Luna</surname>
          </string-name>
          ,
          <article-title>Role of artificial intelligence in multidisciplinary imaging diagnosis of gastrointestinal diseases</article-title>
          ,
          <source>World Journal of Gastroenterology</source>
          <volume>27</volume>
          (
          <year>2021</year>
          )
          <fpage>4395</fpage>
          -
          <lpage>4412</lpage>
          . URL: https://doi.org/10.3748/wjg.v27.
          <year>i27</year>
          .4395. doi:
          <volume>10</volume>
          . 3748/wjg.v27.
          <year>i27</year>
          .
          <fpage>4395</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Barua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vennapusa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Al-Jibury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Muthusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hayee</surname>
          </string-name>
          ,
          <article-title>Efectiveness of artificial intelligence-assisted colonoscopy in early diagnosis of colorectal cancer: A systematic review</article-title>
          , eClinicalMedicine
          <volume>58</volume>
          (
          <year>2023</year>
          )
          <article-title>101875</article-title>
          . URL: https://doi.org/10.1016/j.eclinm.
          <year>2023</year>
          .
          <volume>101875</volume>
          . doi:
          <volume>10</volume>
          .1016/j.eclinm.
          <year>2023</year>
          .
          <volume>101875</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Jing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <article-title>Medvqa: Advances in medical visual question answering</article-title>
          ,
          <source>IEEE Transactions on Medical Imaging</source>
          <volume>42</volume>
          (
          <year>2023</year>
          )
          <fpage>1234</fpage>
          -
          <lpage>1248</lpage>
          . doi:
          <volume>10</volume>
          .1109/TMI.
          <year>2023</year>
          .
          <volume>3245678</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <article-title>Denoising difusion probabilistic models</article-title>
          ,
          <source>in: NeurIPS</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>I. Ejiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <article-title>Synthetic medical image generation and visual question answering: A multimodal pipeline for clinical ai</article-title>
          ,
          <source>Computers in Science and Engineering</source>
          <volume>36</volume>
          (
          <year>2024</year>
          )
          <fpage>56</fpage>
          -
          <lpage>72</lpage>
          . doi:
          <volume>10</volume>
          .1109/CompSciEng.
          <year>2024</year>
          .
          <volume>1023456</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-C.</given-names>
            <surname>Stanciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Andrei</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Storås</surname>
            ,
            <given-names>A. B.</given-names>
          </string-name>
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Bracke</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lecouteux</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macaire</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C. S.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fabre</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Esperança-Rodier</surname>
            , G. Constantin,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Becker</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Damm</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schäfer</surname>
            ,
            <given-names>I. Rodkin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>I. Koychev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          , L.-D. S, tefan, L. Bloch,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. H. P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gautam</surname>
            ,
            <given-names>T. M. G.</given-names>
          </string-name>
          <string-name>
            <surname>Pakull</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>W.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Yim</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
          </string-name>
          , Overview of imageclef 2025:
          <article-title>Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.-M. Drăgulinescu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rückert</surname>
            ,
            <given-names>A. Ben</given-names>
          </string-name>
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            , L. Bloch,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Brüngel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Idrissi-Yaghir</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <article-title>Overview of the ImageCLEF 2024: Multimedia Retrieval in Medical Applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction, Springer, Cham, Switzerland,
          <year>2024</year>
          , pp.
          <fpage>140</fpage>
          -
          <lpage>164</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -71908-
          <issue>0</issue>
          _
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Micikevicius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Alben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Diamos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Elsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ginsburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Houston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kuchaiev</surname>
          </string-name>
          , G. Venkatesh,
          <article-title>Mixed precision training</article-title>
          ,
          <source>in: International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <source>in: International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2022</year>
          . URL: https://openreview.net/forum?id=nZeVKeeFYf9.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lau</surname>
          </string-name>
          , E. Lehman,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <article-title>A dataset and exploration of models for understanding radiology reports</article-title>
          ,
          <source>in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . URL: https: //aclanthology.org/D18-1001/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mou</surname>
          </string-name>
          , E. Xing,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          , Pathvqa:
          <volume>30000</volume>
          +
          <article-title>questions for medical visual question answering</article-title>
          , arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>10286</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gautam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Storås</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Midoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Thambawita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <article-title>Kvasir-VQA: A Text-Image Pair GI Tract Dataset</article-title>
          , in: ACM Conferences,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2024</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .1145/3689096.3689458.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pouget-Abadie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Warde-Farley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ozair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Generative adversarial nets</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NeurIPS) 27</source>
          ,
          <string-name>
            <surname>NeurIPS</surname>
          </string-name>
          ,
          <year>2014</year>
          , pp.
          <fpage>2672</fpage>
          -
          <lpage>2680</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blattmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <article-title>High-resolution image synthesis with latent difusion models</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>10684</fpage>
          -
          <lpage>10695</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lian</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-Y. Zhou</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Less could be better: Parameter-eficient fine-tuning advances medical vision foundation models</article-title>
          ,
          <source>arXiv preprint arXiv:2401.12215</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: Proceedings of the 38th International Conference on Machine Learning (ICML)</source>
          , PMLR,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          . URL: https://proceedings.mlr.press/v139/radford21a.html.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>J.-B. Alayrac</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Luc</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Miech</surname>
            , I. Barr,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Reeves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Pal</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Millican</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dohan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mensch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Cabi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Menick</surname>
            ,
            <given-names>P.-S.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Beyret</surname>
            ,
            <given-names>T. Le</given-names>
          </string-name>
          <string-name>
            <surname>Paine</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Brock</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <article-title>Flamingo: a visual language model for few-shot learning</article-title>
          ,
          <source>arXiv preprint arXiv:2204.14198</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>