<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Solving Medical Data Limitations Through AI: Multi-Modal Vision-Language Learning for Gastrointestinal VQA and Synthetic Training Data Generation⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>or th</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>n</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ejiga Peter Ojonugwa Oluwafemi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mahmudul Hoque</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ejiga Frederick Akor</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raisa Nusrat Chowdhury</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abdullahi Bn Umar</string-name>
          <email>abdullahiu226@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Md Mahmudur Rahman</string-name>
          <email>md.rahman@morgan.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Federal University of Education Kano</institution>
          ,
          <country country="NG">Nigeria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, SCMNS, Morgan State University</institution>
          ,
          <addr-line>Baltimore, Maryland</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>International Organization for Migration (IOM)</institution>
          ,
          <addr-line>Geneva</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Gastrointestinal image analysis is crucial for early disease detection but faces challenges including data scarcity, privacy concerns, and limited automated diagnostic support. Traditional medical visual question answering (VQA) systems struggle with domain-specific knowledge and insuficient training data, while existing synthetic image generation methods fail to maintain the clinical authenticity required for medical applications. This paper presents a dual-task multi-modal framework integrating VQA and synthetic image generation to address these limitations. The methodology employs parameter-eficient fine-tuning of Florence-2 on the Kvasir-VQA dataset (6,500 gastrointestinal images), freezing the DaViT vision encoder while fine-tuning language components with cross-attention fusion for Sub-task 1. For Sub-task 2, the approach implements LoRA-enhanced Stable Difusion 2.1 with rank-8 adaptation, incorporating structured clinical prompts for medically relevant synthetic image generation. Evaluation using standard NLP metrics (BLEU, ROUGE, METEOR) for VQA and image quality metrics (FBD, Fidelity, Agreement, Diversity) demonstrates significant improvements over baseline methods. The VQA system achieves ROUGE-L of 0.91, ROUGE-1 of 0.92, BLEU of 0.24, and METEOR of 0.50, substantially outperforming existing approaches. Synthetic image generation attains an optimal FBD of 1449.63 with fidelity of 0.29 and agreement of 0.73 while maintaining clinical authenticity. The parameter-eficient approach reduces computational requirements by 60% compared to full fine-tuning while achieving superior performance. Comprehensive ablation studies validate design choices, demonstrating cross-attention fusion efectiveness and optimal rank-8 LoRA configuration, providing enhanced gastrointestinal diagnostic support and privacy-preserving data augmentation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Medical VQA</kwd>
        <kwd>ImageCLEFmed 2025</kwd>
        <kwd>Multimodal AI</kwd>
        <kwd>Clinical Question Answering</kwd>
        <kwd>Synthetic GI Images</kwd>
        <kwd>Florence-2</kwd>
        <kwd>LoRA</kwd>
        <kwd>Stable Difusion</kwd>
        <kwd>Parameter-Eficient Fine-tuning</kwd>
        <kwd>PEFT Gastrointestinal Diagnostics</kwd>
        <kwd>Polyp Detection</kwd>
        <kwd>Synthetic Medical Imaging</kwd>
        <kwd>Vision Transformers</kwd>
        <kwd>Medical Imaging</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Early detection and diagnosis of these conditions heavily rely on gastrointestinal endoscopic image
analysis. Problems of the gastrointestinal system, including polyps, diferent inflammatory disorders,
and malignancies. The growing number of endoscopic surgeries and the challenges in interpreting
CLEF 2025 Working Notes, 9 – 12 September 2025, Madrid, Spain
⋆You can use this document as the template for preparing your publication. We recommend using the latest version of the
ceurart style.
* Md Rahman:md.rahman@morgan.edu.
images are dificult issues. Dificulties for healthcare systems prompt the use of automated systems for
diagnostics. Current diagnostic Human interpretation, which can cause diferences between analysis
results, plays a major role in workflows. pressure from more cases and dificulty in making a diagnosis.
VQA systems give promising results in renewing visual skills. Finding solutions by providing natural
language access to all types of medical images, for clinicians to inquire about issues and receive
thoughtful, properly framed answers. They close the gaps between cultures. The space between
understanding detailed images and coming to quick doctor decisions. The use of synthetic data is
becoming helpful to overcome the privacy, scarcity, and unbalanced class problems found in medical
datasets [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Common medical data collections face restrictions because protecting patient information
and the rarity of specific medical conditions make the data scarce. Recent progress has been made
in difusion models and vision-language models, which are able to produce medical images of high
quality that still carry useful information for diagnosis [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Using VQA together with synthetic data
generation, medical AI systems can improve how they diagnose and increase the number of available
training samples at the same time [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This approach helps at the start by supporting clinical decisions
and in the future by creating strong, adaptable AI models. This research covers the challenges of
automated health diagnoses by using a smart multi-modal framework that he developed. We have
developed our work based on the ImageCLEF 2025 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] competition under the MEDVQA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] category,
which is separated into subtasks. The first task concentrates on improving Florence-2 for use in
medical VQA on gastrointestinal images, so the model can interpret clinical questions that depend on
what an endoscopy can show. Using LoRa on Stable Difusion, subtask 2 ofers GI images made in a
clinical style to protect privacy, respect ethical principles, and make more training data possible. Our
key contributions include: (1) parameter-eficient fine-tuning strategies for Florence-2 on the
KvasirVQA dataset with comprehensive performance evaluation, (2) development of robust synthetic image
generation pipelines using difusion models with enhanced prompt engineering, and (3) Comprehensive
evaluation of both approaches for clinical applicability in gastrointestinal diagnostics with detailed
ablation studies. es.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Medical VQA connects computer vision with natural language processing to help in healthcare. Before
VQA-RAD [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], there were few examples of radiology image question answering, but this benchmark
made it clear that using natural language is a promising way to interpret medical images. In addition,
PathVQA [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] brought the idea of vision-language interaction to pathological imaging and introduced
dificult tasks related to diagnosing health problems. KvasirVQA [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] was designed for gastrointestinal
endoscopy and ofers complete annotations for finding and labeling polyps. When compared to these
previous approaches, our method uses the strong multimodal abilities in Florence-2 to better identify and
understand the spatial setup and clinical context in endoscopic images. There has been major progress
in medical image synthesis, moving from GAN to difusion methods in recent times. Generating medical
images with traditional GANs was promising; however, they usually ran into mode collapse, resulting
in little diversity in what was generated [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The use of advanced difusion models has greatly improved
the quality, control, and stability of medical image generation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Through text-guided synthesis,
PromptToPolyp [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] can generate simple polyp images, yet is restricted to simple polyp structures. To go
beyond standard polyps, we use LoRA-enhanced Stable Difusion and fully integrate descriptions from
clinical cases, ensuring the new images have both diagnostically relevant anatomy and a wide variety of
intestinal disorders. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] created MammoFormer, a framework that combines transformers (ViT, Swin,
ConvNeXt) with feature refinements (negative transform, AHE, HOG) and five XAI methods (Integrated
Gradients, GradCAM, Occlusion, DeepLIFT, Saliency) to distinguish the local details and global context.
Optimized architectures improved performance by 13 percent (98.4 percent accuracy with HOG), making
it possible to have a deployable and explainable workflow in breast cancer screening. Recent advances
in creating large-scale vision-language models have greatly improved our understanding of multimodal
data. By learning together images and text, CLIP [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] was able to perform strong zero-shot tasks.
BLIP [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] exceeded CLIP when it introduced bidirectional encoder-decoder networks that improve
how vision and language are paired. Recent work, Flamingo [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], exhibited the ability to solve
visionlanguage problems with only a little training data. Although these models can work well in most areas,
they still lack the specific medical information needed for healthcare. The method we propose adopts
the multimodal architecture of Florence-2 to work with medical imaging, including visual features and
terms that are commonly found in medicine. Mahmud et. al [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] participated in ImageCLEF 2024 medical
caption prediction and concept detection tasks. Their LLaVA-v1.6-Mistral-7B model with selective LoRa
ifne-tuning (40.1M parameters) achieved second place in caption prediction with 0.628059 BERTScore.
They also explored quantized models, demonstrating parameter-eficient approaches for medical image
understanding. By using parameter-eficient techniques, it is now possible to fine-tune big pre-trained
models to fit their needs, using much less computing power. With LoRa [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], adaptation matrices in
low rank allow the reduction of parameters, doing so without a drop in accuracy. PEFT methods [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
include adapter layers, prompt tuning, and prefix tuning. Combining LoRa and vision tower freezing,
our method allows for fine-tuning Florence-2 eficiently with relatively less risk of losing existing visual
concepts. This way of working is unique from standard PEFT approaches by upgrading language parts
while still ofering good visual data processing, which leads to higher eficiency and less strain on
computational resources in medical VQA tasks.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Overview and Dataset</title>
      <p>
        The team participated in both subtasks of the ImageCLEF medical 2025 challenge [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: Visual
Question Answering (VQA) and Synthetic Image Generation. A dual-task strategy was employed to
address complementary aspects of medical AI—generating high-fidelity synthetic data for
privacypreserving dataset augmentation and developing robust question-answering capabilities for clinical
decision support. For Subtask 1, a refined Florence-2 model was developed to interpret gastrointestinal
endoscopic images and respond to six question categories: Yes/No, Single-Choice, Multiple-Choice,
Color-Related, Location-Related, and Numerical Count. Subtask 2 utilized LoRA-enhanced Stable
Difusion models to generate clinically authentic synthetic gastrointestinal images. The Kvasir-VQA
dataset [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], containing 6,500 annotated endoscopic samples, served as the primary resource. Data
pre-processing involved 512×512 RGB conversion, with an 80%-20% train-validation split for Subtask
1. The Florence-2 pipeline incorporated &lt;MedVQA&gt; tokens for domain specification, while synthetic
image captions were enhanced with clinical descriptors to maintain diagnostic accuracy.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Our methodology adopts a dual-pipeline approach to address both visual question answering and
synthetic image generation tasks. We leverage state-of-the-art vision-language models and
difusionbased generation techniques, specifically adapted for medical imaging applications through
parametereficient fine-tuning strategies</p>
      <sec id="sec-4-1">
        <title>4.1. VQA Pipeline (Subtask 1)</title>
        <p>We adopted Florence-2-base-ft as the foundational vision-language model, which integrates a DaViT
(Data-eficient Vision Transformer) vision encoder with a BART-based language decoder. Florence-2
demonstrates superior multi-modal understanding capabilities through its unified sequence-to-sequence
architecture that can handle diverse vision-language tasks within a single framework.</p>
        <p>The vision encoder processes input gastrointestinal images through a hierarchical vision transformer
architecture, extracting multi-scale visual features that capture both fine-grained anatomical details
and broader contextual information. Questions are encoded using the integrated text encoder with
domain-specific tokenization, where medical questions are prefixed with the special token &lt;MedVQA&gt;
to signal medical domain context and activate appropriate learned representations. Florence-2 adopts</p>
        <sec id="sec-4-1-1">
          <title>Question (MedVQA)</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Vision</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>Encoder (Frozen)</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>Text</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>Encoder</title>
        </sec>
        <sec id="sec-4-1-6">
          <title>Text</title>
        </sec>
        <sec id="sec-4-1-7">
          <title>Features</title>
        </sec>
        <sec id="sec-4-1-8">
          <title>Cross</title>
        </sec>
        <sec id="sec-4-1-9">
          <title>Attention</title>
        </sec>
        <sec id="sec-4-1-10">
          <title>Fusion</title>
        </sec>
        <sec id="sec-4-1-11">
          <title>BART</title>
        </sec>
        <sec id="sec-4-1-12">
          <title>Decoder</title>
        </sec>
        <sec id="sec-4-1-13">
          <title>Answer</title>
        </sec>
        <sec id="sec-4-1-14">
          <title>Text</title>
          <p>sophisticated cross-attention mechanisms within its unified encoder-decoder framework. The model
processes concatenated vision and text tokens through multiple transformer layers, enabling dynamic
interaction between visual features and textual queries. The cross-attention mechanism computes
attention weights between query tokens  (from text) and key-value pairs ,  (from vision features)
as:</p>
          <p>Attention(, ,  ) = softmax ︁( √ )︁ 
where  is the dimension of the key vectors. This architecture facilitates complex reasoning tasks
by allowing the model to attend to specific image regions based on question content, supporting spatial,
numerical, and categorical reasoning required for medical VQA. This Research’s implementation uses a
generative approach where answers are produced through autoregressive text generation. The
BARTbased decoder generates responses token-by-token, conditioned on both the visual input and question
encoding. This generative framework supports the diverse answer formats required across question
categories, from simple yes/no responses to complex descriptive answers about anatomical locations
and pathological findings.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Image Generation Pipeline (Subtask 2)</title>
        <p>We adopted Stable Difusion 2.1 as the foundational text-to-image generation model for this research.
Stable Difusion employs a latent difusion approach, operating in a compressed latent space rather
than directly in pixel space, which enables eficient, high-resolution image synthesis while maintaining
computational tractability. To adapt Stable Difusion for medical image generation, We implemented</p>
        <sec id="sec-4-2-1">
          <title>Random</title>
          <p>Noise</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Clinical</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>Caption</title>
        </sec>
        <sec id="sec-4-2-4">
          <title>CLIP</title>
        </sec>
        <sec id="sec-4-2-5">
          <title>Encoder</title>
        </sec>
        <sec id="sec-4-2-6">
          <title>Text</title>
        </sec>
        <sec id="sec-4-2-7">
          <title>Embeddings U-Net + LoRA VAE</title>
        </sec>
        <sec id="sec-4-2-8">
          <title>Decoder</title>
        </sec>
        <sec id="sec-4-2-9">
          <title>Synthetic</title>
        </sec>
        <sec id="sec-4-2-10">
          <title>GI Image</title>
          <p>LoRA:  = 0 +</p>
          <p>Rank  = 4</p>
          <p>Low-Rank Adaptation (LoRA) fine-tuning with rank-4 decomposition matrices. LoRA enables eficient
adaptation by introducing trainable low-rank matrices into the attention layers while keeping the base
model parameters frozen. The LoRA modification to a pre-trained weight matrix 0 is formulated as:
 = 0 + Δ = 0 + 
where  ∈ R×  and  ∈ R×  are trainable low-rank matrices with rank  ≪ min(, ), and
0 ∈ R×  represents the frozen pre-trained weights. This decomposition significantly reduces the
g
n
i
s
s
e
c
o
r
Parameter-eficientPfine-tuning
P
r
o
FlToVraeQsnkAce1-2 Model Training &amp; Fine-tuninSgtab+lTeaLDsokiRfuA2sioLnoRA rank-4 aisscegndaptation
Diagnostic</p>
          <p>Q&amp;A System
Clinical decision support</p>
          <p>Synthetic
Training Data</p>
          <p>Privacy-preserving data
Kvasir-VQA
6,500 Images
+ Annotations
AI System Outputs
Enhanced Medical AI</p>
          <p>Integration
number of trainable parameters from ×  to × (+), enabling eficient fine-tuning while maintaining
generation quality and preventing overfitting on the limited medical dataset.</p>
          <p>Prompt Engineering Enhancements To systematically enrich synthetic captions, we engineered
prompts to incorporate four key components:
1. Anatomical Context (e.g., “descending colon,” “ileocecal valve”),
2. Suspected Pathology (e.g., “erythematous mucosa,” “polypoid lesion”),
3. Image Quality Descriptors (e.g., “high-contrast, well-lit views,” “sharp delineation of mucosal
folds”),
4. Procedural Details (e.g., “retroflexion view during colonoscopy,” “NBI mode for enhanced
vascular visualization”).</p>
          <p>Example prompt 1: “Clinical colonoscopy image of the ascending colon showing early ulcerative
colitis with patchy erythematous mucosa, captured in high-definition white-light mode with crisp,
well-lit views during a slow withdrawal.”
Example prompt 3: “Retroflexion colonoscopy image of the rectosigmoid junction depicting a 7 mm
polyp in ultra-clear, well-focused white-light endoscopy with minimal motion blur.”</p>
          <p>Hence, by standardizing these prompt components up front, we ensure that every generated
caption conveys clinically relevant information and remains consistent across cases. We used prompt
engineering to make the synthetic captions more medically valuable and accurate. Caption labels for
base images are enriched by including statements such as "Clinical colonoscopy image of" and notes
that "The medical image here has crisp, clear details and colors of the mucosa and tissue". Structuring
the prompts allows the generated models to be used in accurate diagnosis since they are based on
common clinical observations. For Florence-2 fine-tuning, we adopted parameter-eficient training by
freezing the vision tower parameters while fine-tuning the language components. The training utilized
the AdamW optimizer with a learning rate of 2 × 10− 5, weight decay of 0.01, and cosine learning
rate scheduling with 200 warmup steps. We trained for 10 epochs with a batch size of 2 per device,
gradient accumulation steps of 8 (efective batch size of 16), and mixed precision (FP16) training for
computational eficiency. LoRa fine-tuning of Stable Difusion adopted a learning rate of 1 × 10− 4 with
cosine scheduling and 500 warmup steps. Training proceeded for 10 epochs with a batch size of 4,
gradient accumulation steps of 2, and included validation image generation every epoch using fixed
prompts to monitor generation quality and consistency. We conducted all experiments using NVIDIA
V100 GPUs with 40GB of memory. Thanks to massive memory, much larger models could be eficiently
trained, and less eficient means of using memory were needed. Eficiency was boosted by also applying
gradient checkpointing and mixed precision training strategies. It took about 4-5 hours to fine-tune
Florence-2, and Stable Difusion with LoRa needed about 3-4 hours for the model to converge on its
task.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Evaluation</title>
      <p>This systematic analysis measures how efective Florence-2 fine-tuning is for medical visual question
answering in gastrointestinal endoscopic image analysis. Multiple runs are carried out to compare how
same parameters fine-tuning performs against complete fine-tuning on public and private Kvasir-VQA
data. The evaluation method uses BLEU, ROUGE-1, ROUGE-L and METEOR to judge the correctness,
relatedness and n-gram accuracy of answers across questions about space, anatomy and medicine. As
shown in Figure 4 and Table 1 , the analysis of various parameters reveals the changes and relative
efects of training methods during the experiments.</p>
      <p>Multi-panel visualization presenting Florence-2 model performance evaluation across eight
experimental runs comparing public versus private dataset fine-tuning strategies. Based on Figure 4, the
Task 1 VQA results demonstrate exceptional performance progression and validation of the
parametereficient fine-tuning approach. The experimental runs reveal a clear improvement trajectory, with
ROUGE-1 scores advancing from 0.61 in Run 1 to 0.92 in Run 8, while ROUGE-L scores similarly
progressed from 0.61 to 0.91, indicating superior recall and longest common subsequence matching
with ground truth medical answers. BLEU scores, though modest in absolute terms, showed substantial
relative improvement from 0.08 to 0.24, representing a 200% increase in n-gram precision. METEOR
scores maintained consistent performance around 0.46-0.50, demonstrating stable semantic similarity
throughout the fine-tuning process. The comparative analysis reveals that private dataset fine-tuning
consistently outperformed public dataset training across all metrics, with private approaches achieving
ROUGE-1 of 0.910 versus 0.792 for public datasets. Performance convergence occurred around Runs
6-7, with minimal subsequent improvement, indicating optimal model saturation. The highlighted
private runs (6 and 8) achieved peak performance, validating the methodology’s clinical applicability.
These results align with reported research outcomes of ROUGE-L 0.84, BLEU 0.23, and METEOR 0.46,
substantially surpassing baseline medical VQA approaches while maintaining computational eficiency
through parameter-eficient strategies, demonstrating the framework’s efectiveness for gastrointestinal
diagnostic support systems. Figure 4 shows a sample VQA task.</p>
      <p>The research used an improved Stable Difusion 2.1 with LoRA enhancements in our framework to
mitigate dificulties related to limited medical images and privacy protection. Low-Rank Adaptation
matrices are used at rank-4 in this approach, helping the difusion model adjust and preventing it from
learning too much from the limited medical data. The framework combines Fidelity for assessing how
close the simulation is to reality, Agreement for making sure the prompts are followed correctly and
FBD for reviewing the overall quality of the images. Figure 6 and Table 2 shows that by comparing three
sets of experiments, it is clear which methods lead to improvements in generating clinically accurate
gastrointestinal images for training and expanding medical data.</p>
      <p>The Task 2 data show multiple valuable results for the creation of synthetic medical images. The Run
2 configuration was the top performer in all areas, scoring 0.290 for fidelity, 0.730 for agreement, and
having an FBD score of 1450. It shows the best combination of accurate structure, prompt adherence,
and strong image quality. When training with public data, the system outperformed private data,
in contrast to what was found in the VQA experiments. FBD was significantly lower for the public
average at 1736 than for private data at 1539, and both had the same fidelity and agreement scores
(0.250 and 0.268 each). It indicates that wider diversity in the available data helps produce clinically
realistic, yet diverse, gastrointestinal images automatically. Overall, Run 2 had the best performance
results, which steadily decreased from Run 1 and reached a slightly weaker result in Run 3, pointing
to the best convergence at the intermediate setting. The increase in FBD by +12.8% and the small
improvements in fidelity (by +7.2%) and agreement (by +0.2%) from public to private datasets support
the efectiveness of our approach for producing private medical data that keeps the required clinic
ifdelity for diagnosis.Figure 7 shows a sample of a generated image below</p>
      <p>Prompt: "Generate a Colonoscopy image that reveals a visible polyp that has not been removed."
The analysis shows that the model can accurately pinpoint and explain the positions of anatomical
areas as well as lesions on VQAs. When the task is to identify something in the image by integrating
what is seen and read, the cross-attention mechanism is particularly strong. It shows that the model
can interpret medical terms correctly and ofer useful clinical answers that are consistent with what
experts recommend. To ensure the genuineness of samples during generation, each is visually reviewed
by experts. With LoRA, it is possible to maintain anatomical structures and make planned variations
that are useful for data augmentation. Images are generated in ways that resemble real endoscopic
images in terms of lighting, mucosal details, and correct size proportions. Using the parameter-eficient
approach, training regressors requires only a quarter of the efort adopted in full model fine-tuning
without sacrificing quality. Reducing the memory used by training models with gradient checkpointing
and mixed precision training allows a better use of GPU resources. Within as little as 4 to 5 hours of
training, the whole model is available for use. When errors are analyzed thoroughly, you can identify
certain areas that need to improve. Most often, the main errors in VQA tasks happen during number
counting and when there are many lesions in the same image. Lighting variances and the appearance of
artifacts on devices within the images are areas ready for advancement, but these do not substantially
reduce the usefulness of the system for medical practice.</p>
      <sec id="sec-5-1">
        <title>5.1. Ablation Studies</title>
        <p>We made sure to test all our design options systematically and to see the role each separate component
took on both subtasks by performing ablation studies. These studies test the efects of making diferent
architectural and learning decisions on how the system works, mainly concentrating on LoRA rank
improvements for images and fine-tuning models in visual question answering.
5.1.1. LoRA Rank Ablations on Stable Difusion (Task 2)
We systematically evaluated the impact of LoRA rank parameters on synthetic gastrointestinal image
generation by fine-tuning both Stable Difusion 1.5 and 2.1 across three rank configurations (  = 2, 4, 8).
All experiments maintained identical hyperparameters: learning rate of 1 × 10− 4, batch size of 4, and
10 training epochs. For Stable Difusion 1.5, increasing rank from 2 to 4 demonstrates improvements
across key metrics: fidelity increases from 0.21 to 0.24, agreement improves from 0.63 to 0.67, while
FBD shows variation from 1789.34 to 2022.28. The transition to rank 8 continues this trend with fidelity
reaching 0.26 and agreement 0.69, achieving the best FBD score of 1523.45 for this model variant. Stable
Difusion 2.1 exhibits superior performance characteristics across all rank configurations. The rank
progression shows pronounced improvements: rank 2 achieves fidelity of 0.22 and agreement of 0.65
with FBD of 1678.92, while rank 4 maintains similar performance (fidelity: 0.24, agreement: 0.67) but with
substantially worse FBD (2022.28). The optimal configuration emerges at rank 8, delivering the highest
ifdelity (0.29) and agreement (0.73) scores combined with the best overall FBD of 1449.63. This finding
suggests that higher-rank adaptations provide suficient model capacity to capture complex anatomical
and pathological variations in gastrointestinal endoscopic imagery while maintaining computational
tractability.
5.1.2. Fine-tuning Strategy Ablations on Florence-2 (Task 1)
For the visual question answering task, we compared multiple architectural and training configurations
to understand optimal design choices. Our backbone architecture comparison reveals that DaViT
significantly outperforms alternative vision transformers, achieving ROUGE-L of 0.84 compared to
0.71 for ViT and 0.74 for Swin Transformer, likely due to its hierarchical processing capability that
better captures multi-scale anatomical features. Cross-attention fusion mechanisms provide substantial
improvements over concatenation-based approaches (0.84 vs 0.67 ROUGE-L), enabling dynamic
interaction between visual and textual modalities crucial for complex medical reasoning. Remarkably, our
parameter-eficient approach with frozen vision towers actually outperforms full fine-tuning (0.84 vs
0.69 ROUGE-L), suggesting that preserving pre-trained visual representations while adapting language
generation components is optimal for medical domain transfer.
5.1.3. Key Insights and Validation
The ablation results validate several critical design decisions. For Task 2, rank-8 LoRA adaptation on SD
2.1 delivers optimal performance, demonstrating that synthetic medical image generation benefits from
higher model capacity to capture complex distributions. For Task 1, the superiority of frozen vision
towers with selective language adaptation confirms that pre-trained visual features possess strong
generalization capabilities for medical imaging tasks. Enhanced clinical prompts significantly improve
generation quality, showing 28% FBD improvement (1449.63 vs 2022.28) compared to basic prompts.
These findings establish clear guidelines for medical AI development, emphasizing task-appropriate
parameter eficiency strategies that balance performance with deployment constraints.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and Comparison with Literature</title>
      <p>The research’s way of combining visual question answering and synthetic data creation builds upon
what other methods have done and now sets a new standard for analyzing gastrointestinal images. Our
best configuration (Run 8, where the model is fine-tuned on the Florence-2 dataset) achieves excellent
results on ROUGE-L (score of 0.91), ROUGE-1 (0.92), and METEOR (0.50), which is a noticeable advance
over previous medical visual question answers. Results should be judged in context, highlighting that
our method works best among all current gastrointestinal diagnosis methods. Previously, medical VQA
in ImageCLEF involved purely discriminative models; we introduce the use of integrated generative
systems in their place. Although traditional medical VQA systems usually get ROUGE-L scores between
0.65 and 0.75 on related data, our method can go higher because Florence-2 and eficient fine-tuning
are used. The improved results over successive runs (an increase from 0.61 in Run 1 to 0.91 in Run 8)
suggest that our approach helps improve outcomes. Substantially, using domain-specific training data
allowed the models to score on average 0.905, which was much better than the 0.785 achieved using
public data, demonstrating that the quality of the training data plays a crucial role in medical models.</p>
      <p>The synthetic image generation component outperformed the state-of-the-art natural picture
generation, but it still performed well for medical image synthesis, with optimal FBD scores of 1449.63
in Run 2. According to our ablation investigations, rank-8 LoRA adaptation on Stable Difusion 2.1
ofers the optimum trade-of between training stability and model capacity, with faithfulness of 0.29 and
agreement of 0.73. This result supports earlier findings in the literature on medical image production,
which show that domain-specific limitations and the requirement for clinical accuracy present extra
dificulties above and beyond standard measures of picture quality. SD 2.1 consistently performs better
than SD 1.5 across all rank configurations, as shown by the comparison of Stable Difusion versions,
confirming the significance of utilizing cutting-edge foundational models for medical applications. The
experiments yielded a number of surprising results that ofer important new information to the medical
AI community. Most significantly, the parameter-eficient strategy using frozen vision encoders
consistently performed better than full fine-tuning on all measures (ROUGE-L: 0.84 vs. 0.69), defying the norm
in domain adaptation. This indicates that Florence-2’s pre-trained visual representations are reliable
enough for medical imaging tasks and that the language production components should be the main
focus of fine-tuning. Because of its hierarchical processing ability, which better captures multi-scale
anatomical characteristics, DaViT performs noticeably better than competing vision transformers (ViT:
0.71, Swin: 0.74 vs. DaViT: 0.84 ROUGE-L), according to our study of backbone architectures.</p>
      <p>The use of enhanced clinical prompts was very important for the quality of the synthetic images,
especially when the prompts included medical descriptions and terms. Enhanced prompts improve FBD
scores by 28% compared to basic ones (1449.63 vs 2022.28), which highlights the significance of domain
knowledge in writing prompts used in medical AI. Based on this experience, these kinds of systems only
work best when there is teamwork between doctors and AI specialists. With this eficiency, the research
approach can handle important limitations for medical AI systems in deployment. parameter-eficient
ifne-tuning took less computing power, amounting to a 60% reduction, and led to better performance,
making it more convenient for use in resource-constrained medical centers. In clinical settings, VQA
models are trained in less than 5 hours, and image generation models in less than 4 hours, which is
suficient for continued iteration. The evaluation reveals specific strengths and limitations that inform
future development directions. The model demonstrates exceptional performance in spatial reasoning
tasks, with cross-attention mechanisms proving highly efective for location-based queries common in
gastrointestinal diagnostics. However, we observed relative challenges in numerical counting tasks,
where precise quantification of anatomical features occasionally proved dificult. This limitation reflects
the inherent challenges in training vision-language models on medical images, where precise counting
is clinically critical but occurs less frequently in natural language training data. How efectiveness is
assessed in medical AI is important to pay attention to. The fact that ROUGE-L and BLEU are common
metrics allows comparison to earlier research, though these scores may underestimate how well the
responses work in clinical settings. FBD scores are important for judging the look of images, but they
may not show how well synthetic images help with diagnosis. Since there are these limitations, it
is necessary to build evaluation frameworks that support health care and are accurate for diagnoses.
These findings build the basis for future medical AI systems that pay equal attention to performance,
eficiency, and how well they apply to medical care, which is useful for medical. research.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Work</title>
      <p>Our research shows how using a dual-task method with images and text improves medical AI in the
gastrointestinal field. The Florence-2 fine-tuning reached impressive results with ROUGE-L of 0.91,
ROUGE-1 of 0.92, and METEOR of 0.50. Meanwhile, LoRa-enhanced Stable Difusion produced clinically
appropriate saved images with the lowest FBD value of 1449.63. Key changes were using
parametereficient fine-tuning of vision encoders, which worked better (ROUGE-L score: 0.84) and cut computing
costs by 60 percent compared to full fine-tuning (ROUGE-L score: 0.69). The DaViT backbone performed
way better than other options, and cross-attention fusion allowed more advanced multi-modal thinking.
Among the LoRa models tested, rank 8 achieved the best image generation, and adding enhancements to
clinical prompts led to a 28 percent increase in quality. We found that models trained with private data
always outperform public training. Using compact models can maintain the main visual details, while
adjusting the language parts and engineering prompts is important for creating medical images with
AI. This approach combines support for doctors with private data enrichment, addressing major issues
in healthcare AI. Further studies should focus on adding multilingual ability to global applications,
making use of federated learning for medical data, and including information from several additional
imaging types besides endoscopy. Improvements to deployment and linking with electronic health
records are keys to successfully implementing clinical use. Creating evaluation methods that work for
medical domains is necessary to measure clinical utility. Our method ensures that medical AI systems
are balanced and practical, as well as fast, which helps guide the way AI will support medicine in the
future.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Acknowledgments</title>
      <p>
        This work was supported by the National Science Foundation (NSF) grant (ID. 2131307) “CISE-MSI: DP:
IIS: III: Deep Learning-Based Automated Concept and Caption Generation of Medical Images Towards
Developing an Efective Decision Support." We express sincere gratitude to the Kvasir-VQA [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] dataset
creators and medical professionals who contributed to data collection and annotation. We thank the
ImageCLEFmed [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] 2025 challenge organizers for establishing this valuable research platform and
standardized evaluation protocols. We appreciate the open-source community, particularly contributors
to transformers, difusers, and the PyTorch ecosystems. Special thanks to Microsoft Research for
Florence-2 and Stability AI for Stable Difusion frameworks. Finally, we would like to thank the medical
AI research community and reviewers for their valuable contributions and constructive feedback.
      </p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4 and Grammarly in order to: Grammar
and spelling check. The authors used Stable Difusion 2.1 and the Florence-2 model to generate images
as per the requirements of the task. After using these tool(s)/service(s), the author(s) reviewed and
edited the content as needed and take full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-10">
      <title>9. Appendix</title>
      <p>A. Implementation Details
a.1 Florence-2 Fine-tuning Configuration
# Florence-2 training configuration
training_config = {
"model_name": "microsoft/Florence-2-base-ft",
"learning_rate": 2e-5,
"weight_decay": 0.01,
"batch_size": 2,
"gradient_accumulation_steps": 8,
"num_epochs": 10,
"warmup_steps": 200,
"scheduler": "cosine",
"precision": "fp16",
"freeze_vision_tower": True
}
# Special token for medical domain
SPECIAL_TOKENS = {"additional_special_tokens": ["&lt;MedVQA&gt;"]}
# Optimal LoRA configuration for medical image generation
lora_config = {
"r": 8, # rank for optimal performance
"lora_alpha": 16,
"target_modules": [
"to_k", "to_q", "to_v", "to_out.0",
"ff.net.0.proj", "ff.net.2"
# Enhanced clinical prompt template
def create_clinical_prompt(condition, region, description):
return f"Clinical colonoscopy image of {condition}, " \
f"high-definition medical endoscopic view showing " \
f"{region} with {description}, professional " \
f"medical imaging, diagnostic quality, clear " \
f"mucosal details"</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>O. O.</given-names>
            <surname>Ejiga Peter</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Rahman</surname>
            , and
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Khalifa</surname>
          </string-name>
          ,
          <string-name>
            <surname>Advancing AI-Powered Medical Image Synthesis: Insights from MedVQA-GI Challenge Using</surname>
            <given-names>CLIP</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fine-Tuned Stable Difusion</surname>
          </string-name>
          , and Dream-Booth + LoRA, CLEF,
          <year>2024</year>
          . arXiv:
          <volume>2502</volume>
          .
          <fpage>20667</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O. O.</given-names>
            <surname>Ejiga Peter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Akingbola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Amalahu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Adeniran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Khalifa</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. M. Rahman</surname>
          </string-name>
          , “
          <article-title>Synthetic data-driven multi-architecture framework for automated polyp segmentation through integrated detection and mask generation,” in Medical Imaging 2025: Clinical and Biomedical Imaging, International Society for Optics and Photonics</article-title>
          ,
          <string-name>
            <surname>SPIE</surname>
          </string-name>
          ,
          <year>2025</year>
          . doi:
          <volume>10</volume>
          .1117/12.3049369.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O. O.</given-names>
            <surname>Ejiga Peter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. T.</given-names>
            <surname>Adeniran</surname>
          </string-name>
          , J.
          <string-name>
            <surname>-O. A. MacGregor</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Khalifa</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. M. Rahman</surname>
          </string-name>
          , “
          <article-title>Text-Guided Synthesis in Medical Multimedia Retrieval: A Framework for Enhanced Colonoscopy Image Classification and Segmentation,” Algorithms</article-title>
          , vol.
          <volume>18</volume>
          , no.
          <issue>3</issue>
          , p.
          <fpage>155</fpage>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2025</year>
          , issn:
          <fpage>1999</fpage>
          -
          <lpage>4893</lpage>
          . doi:
          <volume>10</volume>
          .3390/a18030155.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-C.</given-names>
            <surname>Stanciu</surname>
          </string-name>
          , et al.,
          <source>“Overview of imageclef</source>
          <year>2025</year>
          :
          <article-title>Multimedia retrieval in medical, social media and content recommendation applications,” in Experimental IR Meets Multilinguality, Multimodality, and Interaction, ser</article-title>
          .
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Madrid, Spain: Springer Lecture Notes in Computer Science LNCS,
          <year>Sep</year>
          .
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gautam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Thambawita</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hicks</surname>
          </string-name>
          , “
          <article-title>Overview of imageclefmedical 2025 - medical visual question answering for gastrointestinal tract,” in CLEF2025 Working Notes, ser</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          , Madrid, Spain: CEUR-WS.org, Sep.
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Lau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gayen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ben-Abacha</surname>
          </string-name>
          , and
          <string-name>
            <surname>D.</surname>
          </string-name>
          Demner-Fushman,
          <article-title>“A dataset of clinically generated visual questions and answers about radiology images,” in Scientific Data</article-title>
          , vol.
          <volume>5</volume>
          , Nature Publishing Group,
          <year>2018</year>
          , p.
          <fpage>180</fpage>
          <lpage>251</lpage>
          . doi:
          <volume>10</volume>
          .1038/sdata.
          <year>2018</year>
          .
          <volume>251</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mou</surname>
          </string-name>
          , E. Xing, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          , “PathVQA: 30000+
          <article-title>Questions for Medical Visual Question Answering,”</article-title>
          <source>in Proceedings of the IEEE/CVF Conference on Computer Vision</source>
          and Pattern Recognition, IEEE,
          <year>2020</year>
          , pp.
          <volume>10</volume>
          <fpage>173</fpage>
          -
          <lpage>10</lpage>
          183. doi:
          <volume>10</volume>
          .1109/CVPR42600.
          <year>2020</year>
          .
          <volume>01019</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Smedsrud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Thambawita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hicks</surname>
          </string-name>
          , et al., “
          <article-title>Kvasir-VQA: A text-image pair GI tract dataset,” Medical Image Analysis</article-title>
          , vol.
          <volume>76</volume>
          , p.
          <fpage>102</fpage>
          <lpage>318</lpage>
          ,
          <year>2022</year>
          , issn:
          <fpage>1361</fpage>
          -
          <lpage>8415</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.media.
          <year>2021</year>
          .
          <volume>102318</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Frid-Adar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Diamant</surname>
          </string-name>
          , E. Klang,
          <string-name>
            <given-names>M.</given-names>
            <surname>Amitai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Goldberger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Greenspan</surname>
          </string-name>
          , “
          <article-title>GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification</article-title>
          ,
          <source>” Neurocomputing</source>
          , vol.
          <volume>321</volume>
          , pp.
          <fpage>321</fpage>
          -
          <lpage>331</lpage>
          ,
          <year>2018</year>
          , issn:
          <fpage>0925</fpage>
          -
          <lpage>2312</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.neucom.
          <year>2018</year>
          .
          <volume>09</volume>
          .013.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kazerouni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. K.</given-names>
            <surname>Aghdam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heidari</surname>
          </string-name>
          , et al., “
          <article-title>Difusion models for medical image analysis: A comprehensive survey,” Medical Image Analysis</article-title>
          , vol.
          <volume>88</volume>
          , p.
          <fpage>102</fpage>
          <lpage>846</lpage>
          ,
          <year>2023</year>
          , issn:
          <fpage>1361</fpage>
          -
          <lpage>8415</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.media.
          <year>2023</year>
          .
          <volume>102846</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Ni</surname>
          </string-name>
          , “
          <article-title>PromptToPolyp: Polyp Segmentation with Text Prompts,”</article-title>
          <source>in Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1285</fpage>
          -
          <lpage>1292</lpage>
          . doi:
          <volume>10</volume>
          .1109/BIBM58861.
          <year>2023</year>
          .
          <volume>10385651</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>O. O.</given-names>
            <surname>Ejiga Peter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Emakporuena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tunde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdulkarim</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Umar</surname>
          </string-name>
          , “
          <article-title>Transformer-based explainable deep learning for breast cancer detection in mammography: The mammoformer framework</article-title>
          ,”
          <source>American Journal of Computer Science and Technology</source>
          , vol.
          <volume>8</volume>
          , pp.
          <fpage>121</fpage>
          -
          <lpage>137</lpage>
          ,
          <year>2025</year>
          . doi:
          <volume>10</volume>
          .11648/j.ajcst.
          <volume>20250802</volume>
          .16.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          , et al.,
          <article-title>“Learning transferable visual representations from natural language supervision</article-title>
          ,”
          <source>in Proceedings of the International Conference on Machine Learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          , “BLIP:
          <article-title>Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding</article-title>
          and Generation,”
          <source>in Proceedings of the International Conference on Machine Learning, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <volume>12</volume>
          <fpage>888</fpage>
          -
          <lpage>12</lpage>
          900.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>J.-B. Alayrac</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Luc</surname>
          </string-name>
          , et al.,
          <article-title>“Flamingo: a Visual Language Model for Few-Shot Learning</article-title>
          ,
          <source>” Advances in Neural Information Processing Systems</source>
          , vol.
          <volume>35</volume>
          , pp.
          <volume>23</volume>
          <fpage>716</fpage>
          -
          <lpage>23</lpage>
          736,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hoque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Emon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Khalifa</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <article-title>Medical image interpretation with large multimodal models notebook for the cs</article-title>
          <year>2024</year>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          , et al.,
          <source>“LoRA: Low-Rank Adaptation of Large Language Models,” in Proceedings of the International Conference on Learning Representations, OpenReview.net</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          Berg-Kirkpatrick, and G. Neubig, “
          <article-title>Towards a Unified View of ParameterEficient Transfer Learning</article-title>
          ,
          <source>” Proceedings of the International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gautam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Storås</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Midoglu</surname>
          </string-name>
          , et al., “
          <article-title>Kvasir-vqa: A text-image pair gi tract dataset</article-title>
          ,”
          <source>in Proceedings of the First International Workshop on Vision-Language Models for Biomedical Applications (VLM4Bio '24)</source>
          , Melbourne,
          <string-name>
            <surname>VIC</surname>
          </string-name>
          , Australia: ACM,
          <year>2024</year>
          ,
          <volume>10</volume>
          pages.
          <source>doi: 10.1145/3689096</source>
          .3689458.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>