<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal AI for Gastrointestinal Diagnostics: Tackling VQA in MEDVQA-GI 2025</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sujata Gaihre</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amir Thapa Magar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prasuna Pokharel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laxmi Tiwari</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fusemachine</institution>
          ,
          <country country="NP">Nepal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Logictronix Technologies</institution>
          ,
          <country country="NP">Nepal</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our approach to Subtask 1 of the ImageCLEFmed MEDVQA 2025 Challenge, which targets visual question answering (VQA) for gastrointestinal endoscopy. We adopt the Florence model-a large-scale multimodal foundation model-as the backbone of our VQA pipeline, pairing a powerful vision encoder with a text encoder to interpret endoscopic images and produce clinically relevant answers. To improve generalization, we apply domain-specific augmentations that preserve medical features while increasing training diversity. Experiments on the Kvasir-VQA dataset show that fine-tuning Florence yields accurate responses on the oficial challenge metrics. Our results highlight the potential of large multimodal models in medical VQA and provide a strong baseline for future work on explainability, robustness, and clinical integration. The code is publicly available at: github.com/TiwariLaxuu/VQA-Florence.git.</p>
      </abstract>
      <kwd-group>
        <kwd>Medical VQA</kwd>
        <kwd>ImageCLEFmed 2025</kwd>
        <kwd>Multimodal AI</kwd>
        <kwd>Clinical Question Answering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        VQA has evolved significantly over the past decade, with growing emphasis on reducing dataset biases
and ensuring visual grounding in answers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Early VQA models demonstrated high performance by
exploiting statistical patterns in the questions, rather than truly interpreting the content of the image
a critical flaw when applying such models to sensitive domains such as medical diagnostics.
      </p>
      <p>Medical Visual Question Answering (Med-VQA) has rapidly evolved with advances in both computer
vision and natural language processing. Traditional Med-VQA approaches include Modality-Ensemble</p>
      <p>
        Visual Features (MEVF)—which integrate visual cues across modalities—combined with Bilinear
Attention Networks (BAN) a technique for modeling image-question interactions using low-rank bilinear
pooling. Conditional Reasoning (CR) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], as well as Contrastive Pretraining and Representation
Distillation (CPRD) with BAN [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], treated the problem as a classification task, relying heavily on predefined
answer sets. These methods struggled with open-ended questions due to limited integration of external
medical knowledge and semantic reasoning.
      </p>
      <p>
        More recent work has explored visual language pretraining. PubMedCLIP [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] leveraged contrastive
learning in medical text image pairs, while Masked Multimodal Masked Autoencoder (M3AE) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
used masked modeling for joint vision-language alignment. The state-of-the-art multimodal concept
alignment pre-training (MMCAP) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] proposed a novel Multi-modal Concept Alignment Pre-training
approach using a knowledge graph derived from the Unified Medical Language System (UMLS) and
imagecaption datasets. By aligning visual and textual data through a transformer-based encoder-decoder
framework with external medical knowledge, MMCAP achieved top performance on both
SemanticallyLabeled Knowledge-Enhanced Dataset(SLAKE) [9]and VQA on radiology image datasets [10].
      </p>
      <p>A pivotal work addressing this issue was proposed by Goyal et al. [11] in "Making the V in VQA
Matter". The authors identified that models trained on the existing VQA dataset [ 12] often relied heavily
on language priors. For example, questions like "Is there a clock?" could be correctly answered without
analyzing the image due to dataset bias. To address this, they introduced VQA v2.0 [11], a balanced
dataset that pairs each question with two visually similar images requiring diferent answers. This
structure significantly reduced reliance on question-only cues, forcing models to ground their answers
in visual content. Their findings showed a noticeable performance drop in models previously successful
on original VQA, confirming the overreliance on language. They also proposed a counter-example
retrieval method as a basic form of model interpretability. These design principles—bias mitigation,
dataset balancing, and explainability—are highly relevant to medical VQA, where clinical safety depends
on faithful visual reasoning.</p>
      <p>Building on these foundational ideas, Gautam et al. [13] introduced Kvasir-VQA, a domain-specific
VQA dataset for GI endoscopy. Derived from the HyperKvasir dataset, Kvasir-VQA consists of over
6,500 annotated image-question-answer (IQA) triplets, with questions closely aligned to real clinical
scenarios. Unlike generic VQA datasets, Kvasir-VQA emphasizes clinically significant questions across
diverse GI conditions, procedures, and anatomical regions. This design enables models to learn nuanced
multimodal patterns critical for accurate diagnostic reasoning. To address the scarcity of annotated
medical images, the dataset integrates expert-verified questions spanning identification, localization,
and reasoning tasks—thereby advancing medical AI benchmarks in the GI domain.</p>
      <p>While earlier approaches on Kvasir-VQA have utilized baseline multimodal models with standard
data augmentation, our approach for MEDVQA 2025 Task 1 builds upon and extends this line of
work. We adopt Florence, a large-scale multimodal transformer known for robust visual-language
alignment, and introduce domain-specific image augmentations tailored for endoscopic imagery. These
augmentations preserve critical visual features (e.g., mucosal texture, bleeding points) while simulating
real-world variability. Additionally, we employ a generative decoding strategy that enables our model
to produce clinically precise, open-ended answers, in contrast to classification-based systems that limit
expressiveness. These innovations lead to state-of-the-art performance on the Kvasir-VQA benchmark
and represent a promising step toward trustworthy medical VQA systems.</p>
      <p>Guo et al. [14] tackled the often-overlooked problem of unanswerable questions in VQA. In real-world
applications, including clinical and scientific settings, VQA systems frequently encounter questions
that cannot be answered given the provided visual input. However, most existing VQA benchmarks fail
to account for these scenarios, leading models to produce confident yet incorrect answers that can be
misleading or even harmful.</p>
      <p>To address this, the authors introduced unknown visual question answering (UNK-VQA), a dataset
specifically designed to evaluate a model’s ability to recognize when a question is unanswerable. They
constructed this dataset by systematically modifying answerable questions from standard VQA datasets
using perturbation techniques such as word replacement, semantic negation, and object substitution.
These modifications generated challenging questions that remained linguistically coherent but lacked
suficient visual evidence to answer correctly. By combining both answerable and unanswerable
questions, UNK-VQA provides a rigorous test of a model’s abstention capabilities.</p>
      <p>Guo et al. conducted extensive evaluations using several state-of-the-art vision-language models,
including BLIP [15], LLaVA [16], and GPT-4V [14]. Despite performing well on traditional VQA
benchmarks, these models often failed on UNK-VQA, frequently producing overconfident yet incorrect
answers. This finding highlights a significant limitation in current VQA architectures: the inability to
reliably abstain from answering when faced with insuficient visual information.</p>
      <p>Overall, UNK-VQA ofers a valuable resource for the community to evaluate and improve VQA
models’ ability to handle uncertainty, a critical requirement for deploying such systems in sensitive
domains where incorrect answers can have serious consequences.</p>
      <p>In summary, by combining principles from general VQA (e.g., dataset balancing and grounding
from Goyal et al.) with domain-specific insights from Kvasir-VQA, our method addresses the unique
challenges of medical VQA—ensuring accuracy, interpretability, and clinical relevance in gastrointestinal
diagnostics.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Description</title>
      <p>In this study, we participated in Subtask 1: Visual Question Answering (VQA) of the
ImageCLEFmedMEDVQA-GI 2025 challenge [17]. The objective of this subtask is to develop intelligent systems capable
of automatically answering clinically relevant questions based on GI images. This task is especially
important in the medical field, where accurate image-based question answering can support clinical
diagnosis, documentation, and education.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset Description</title>
        <p>We used the Kvasir-VQA dataset [13] for developing and evaluating our visual question answering
(VQA) models. This multimodal dataset is derived from the extended HyperKvasir image repository
and comprises approximately 58,849 image–question–answer (IQA) triplets associated with 6,500
high-resolution gastrointestinal (GI) endoscopy images.</p>
        <p>Each image in the dataset is linked to multiple QA pairs and is annotated with detail clinical context.
Specifically, each IQA sample includes:
• Image: A gastrointestinal (GI) endoscopic image.
• Source: The clinical label associated with the image, selected from six predefined categories.
• Question: A natural language query pertaining to the image, focusing on diagnostic, anatomical,
or procedural aspects.</p>
        <p>• Answer: A concise response that directly addresses the question.</p>
        <p>For computational eficiency, we utilized a 1% stratified subset of the full dataset. This subset was
initially divided into 90% training and 10% testing. The testing portion was further split into 90%
training and 10% validation, resulting in a final train/validation/test split. This ensured a balanced and
representative subset while allowing eficient fine-tuning and evaluation of our models.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Exploratory Data Analysis (EDA)</title>
        <p>To better understand the dataset, we conducted a preliminary exploratory data analysis [18] focusing
on the structure and distribution of samples.</p>
        <p>We visualized sample data to inspect the quality and diversity of the visual-question-answer triplets.
Figure 1 shows examples of endoscopic images with their corresponding clinical questions and answers.
These samples reflect a broad spectrum of question types, including disease identification, anatomical
assessment, and procedural inquiry.</p>
        <p>(a) Image–question–answer triplet 1
(b) Image–question–answer triplet 2</p>
        <p>In addition to qualitative inspection, we quantitatively analyzed the distribution of answers across
the dataset. We found that the dataset contains a mix of common and rare answers. Figure 2a illustrates
the top 10 most frequent answers. Short and generic responses such as none, no, yes, and 0 dominate
the distribution. This highlights a significant class imbalance, where frequently occurring answers may
bias the model if not handled properly during training. Some clinically specific answers like colonoscopy
and polyp are also present, but less frequent.</p>
        <p>Figure 2b shows the distribution of answer lengths measured by the number of words. The majority
of answers consist of a single word, with the frequency dropping sharply for longer responses. This
indicates that most answers in the dataset are concise and classification-like, rather than descriptive.
However, the presence of multi-word answers suggests the need for the model to also handle more
complex, free-form responses.</p>
        <p>In total, the dataset comprises 58,849 image-question-answer (IQA) samples, based on 20 unique
question templates and 502 unique answers. This demonstrates the diversity and complexity of the
task, requiring models capable of handling high class imbalance, short-form predictions, and clinically
rich semantics across a broad answer space.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>To address the challenge of answering clinically relevant questions from gastrointestinal images, we
adopt Florence-2—a unified vision foundation model [19]—as the backbone of our VQA pipeline.</p>
      <sec id="sec-4-1">
        <title>4.1. Base Model Overview</title>
        <p>Florence-2 supports a wide range of computer vision and vision-language tasks, including image
captioning, object detection, referring segmentation [20], and Visual Question Answering (VQA),
using a unified architecture and shared weights. It formulates all tasks within a sequence-to-sequence
framework: for VQA, the model takes an image and a task-specific prompt (the question) and generates
a free-text answer. This prompt-based approach enables consistent inference across modalities and
tasks [21].</p>
        <p>Florence-2 captures both semantic and spatial detail, which is important in the medical VQA where
global (e.g., anatomical site) and local features (e.g., mucosal patterns) inform clinical reasoning. While
the model supports spatial grounding through location tokens, these were not utilized during fine-tuning
due to the lack of region annotations in our dataset.</p>
        <p>The unified, prompt-driven design of Florence-2 ofers:
• Flexible answer generation: Enables detailed, free-form clinical responses beyond classification.
• Interpretability: Spatial grounding via location tokens enhances transparency.
• Scalability: Pretraining on FLD-5B (5.4B annotations on 126M images) supports generalization
to data-scarce domains.</p>
        <p>These capabilities make Florence-2 particularly suitable for medical VQA tasks such as those in the
MEDVQA-GI 2025 challenge.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Model Architecture</title>
        <p>Florence-2 adopts a modular architecture that integrates a vision encoder and a multi-modal
encoderdecoder to align image and text representations. The vision encoder is based on DaViT (Dual Attention
Vision Transformer), utilizing a frozen ViT-L/14 backbone pretrained at 896× 896 resolution with a
patch size of 16× 16, resulting in 196 visual tokens per image. These are mapped into a 196× 1024 feature
map, maintaining spatial information via learned 2D positional embeddings. All parameters of the
vision encoder are frozen during fine-tuning to preserve the general-purpose representations from
pretraining.</p>
        <p>The multi-modal encoder-decoder module fuses visual and textual inputs. Textual prompts are
tokenized and embedded, while visual tokens are projected and normalized to match the text embedding
space. The concatenated sequence is then processed through a multi-modal encoder to learn joint
representations.</p>
        <p>For decoding, Florence-2 employs a 2.7B parameter causal language model with 32 transformer layers
and 32 attention heads, each with a hidden size of 2048. Cross-attention layers are added every fourth
layer to incorporate visual context into the text generation process. During inference, the decoder
generates answers step-by-step using temperature sampling ( = 0.7), with its attention mechanism
using hidden states as queries and the projected 196× 256 image features as key-value pairs.</p>
        <p>The model is trained using a standard cross-entropy loss objective:
ℒ = −
||
∑︁ log  ( | &lt;, )
=1
(1)
where  represents model parameters,  is the combined input (image and question), and  is the target
answer sequence.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. VQA Adaptability and Fine-Tuning</title>
        <p>Florence-2 performs well in both zero-shot and fine-tuned scenarios:
• Zero-shot generalization: Exhibits robust performance without VQA-specific training [22].
• Fine-tuning transferability: Performance improves significantly when adapted to VQA datasets
like DocVQA.
• Unified modeling: Its prompt-based, task-agnostic formulation eliminates the need for
taskspecific heads, improving generalizability.
• Parameter eficiency: With 0.23B (base) and 0.77B (large) parameters, Florence-2 achieves
competitive VQA performance after fine-tuning.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Fine-Tuning Setup</title>
        <p>We fine-tuned the microsoft/Florence-2-base-ft checkpoint, keeping the ViT-L/14 vision tower
frozen. Inputs (images and questions) were processed using the model’s AutoProcessor. Answers
were tokenized with padding replaced by -100 for causal loss computation. No adapter-based methods
(e.g., LoRA) were used. Evaluation employed BLEU, METEOR, and ROUGE-L using the HuggingFace
evaluate library.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Training Protocol</title>
        <p>Training was conducted using AdamW ( 1 = 0.9,  2 = 0.999), a learning rate of 7.8 × 10− 6 (cosine
decay over 20 epochs), and weight decay of 0.1. The efective batch size was 20 (5 per GPU with 4
gradient accumulation steps). Regularization included gradient clipping (max norm = 1.0) and dropout
( = 0.1) on attention weights. Evaluation was done after each epoch, with early stopping after 3
epochs without improvement. Baselines were compared using paired t-tests ( &lt; 0.05).</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Implementation Details</title>
        <p>Experiments were implemented in PyTorch with CUDA and HuggingFace Transformers. Training used
NVIDIA T4 GPUs (16GB), mixed precision (FP16 for matrix ops, FP32 for loss), and a 72-hour time
budget over 10 epochs. Experiment tracking was done using Weights &amp; Biases, with data/versioning
managed by DVC. Reproducibility was ensured through fixed random seeds (42 for data, 3407 for model)
and deterministic algorithms.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Evaluation</title>
      <p>In this section, we present the results of our experiments, the evaluation methodology employed, and
key insights derived from the analysis. We evaluated the VQA performance using standard NLP metrics
including BLEU, ROUGE-1, ROUGE-2, ROUGE-L, and METEOR.</p>
      <p>Clarification: The results reported in Table 1 were obtained after fine-tuning the Florence V2 model
on a 1% stratified subset of the Kvasir-VQA dataset. The model was fine-tuned using domain-specific
augmentations and a causal language modeling loss. The vision backbone (ViT-L/14) was frozen during
training to preserve pretrained visual features, while the language decoder was trained with the AdamW
optimizer (learning rate: 7.8 × 10− 6, weight decay: 0.1). The training process used a batch size of 5 with
gradient checkpointing enabled, and evaluations were performed at each epoch using BLEU, METEOR,
and ROUGE-L as primary metrics.</p>
      <p>Table 1 summarizes the VQA performance of our fine-tuned model across diferent evaluation stages.
On the validation set, the model achieved a BLEU score of 0.12, ROUGE-1 of 0.78, ROUGE-2 of 0.09,
ROUGE-L of 0.77, and a METEOR score of 0.42. Public test results showed improved performance with
a BLEU score of 0.150 and a METEOR score of 0.440. The model performed best on the private test set
with BLEU 0.160, ROUGE-L 0.880, and METEOR 0.490.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Ablation Studies</title>
      <p>In our ablation study, we evaluated four augmentation strategies—no augmentation, heavy, standard,
and fine-tuned—using the Florence V2 model for medical VQA. Table 2: Comparison of augmentation
strategies and their impact on VQA performance using Florence V2.The baseline (no augmentation)
produced low scores (BLEU: 0.00, ROUGE-L: 0.63, METEOR: 0.31), while heavy augmentation further
degraded performance due to unrealistic distortions (e.g., vertical flip), with ROUGE-L and METEOR
dropping to 0.48 and 0.25, respectively. In contrast, standard augmentation (random crop, flip, color
jitter) improved scores (BLEU: 0.12, ROUGE-L: 0.77, METEOR: 0.42). The best results were achieved
using fine-tuned augmentation (BLEU: 0.15, ROUGE-L: 0.80, METEOR: 0.44), confirming that carefully
tuned, domain-aware transformations enhance model generalization and robustness.
* Standard Augmentation includes random crop, flip, and color jitter.
** Heavy Augmentation includes strong distortions like vertical flip and random rotation.</p>
      <sec id="sec-6-1">
        <title>6.1. Performance by Question Type</title>
        <p>To better understand model behavior [23], we conducted a fine-grained evaluation based on the type
of question (e.g., what, where, how). Table 3 reports BLEU, ROUGE-L, and METEOR scores computed
separately for each first-word question type in the validation set. This analysis highlights where the
model performs well (e.g., “where” and “have” questions) and where it struggles (e.g., “how” and “is”),
which helps us understand where improvements are needed.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <p>Our results show that Florence-2, when fine-tuned with clinically informed augmentations, ofers a
promising baseline for medical VQA in gastrointestinal endoscopy. By freezing the ViT-L/14 vision
backbone, we retained robust pretrained features while adapting the decoder to align with
domainspecific linguistic patterns.</p>
      <p>Among the augmentation strategies evaluated, fine-tuned augmentations provided consistent
improvements over both the baseline (BLEU: 0.00, METEOR: 0.31) and heavy augmentations (METEOR:
0.25), achieving the best scores (BLEU: 0.15, METEOR: 0.44). This confirms that medically plausible
transformations help preserve critical visual cues necessary for accurate predictions.</p>
      <p>Analysis by question type revealed stronger performance on spatial and binary questions—such as
those starting with “where” (METEOR: 0.58) and “have” (0.77)—while procedural and abstract questions
like “how” proved more dificult (METEOR: 0.37). Additionally, low BLEU scores across categories
suggest the model often captures the semantic intent but deviates from exact phrasing—a known
limitation of generative models evaluated with n-gram-based metrics.</p>
      <p>Finally, performance gains from validation to private test splits (e.g., METEOR: 0.42 → 0.49) suggest
some generalization, although results remain moderate overall. These findings highlight both the
potential and limitations of large multimodal models in specialized clinical VQA tasks, particularly
under constrained data conditions.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion and Future Work</title>
      <p>This study explored the use of Florence-2, a large-scale multimodal foundation model, for visual question
answering in gastrointestinal endoscopy. The model was adapted using a frozen ViT-L/14 vision encoder
and fine-tuned multimodal layers, showing that even with limited data, clinically meaningful responses
can be generated when supported by realistic, domain-specific augmentations.</p>
      <p>Fine-tuned augmentation strategies led to notable improvements in performance, with METEOR
scores increasing from 0.31 (no augmentation) to 0.44, and up to 0.49 on the private test set. Performance
varied by question type—stronger on spatial and binary queries (e.g., “where,” “have”) and weaker on
procedural or abstract ones (e.g., “how”)—indicating strengths in visual pattern recognition but it still
struggles with complex clinical reasoning.</p>
      <p>Several directions appear promising for improving clinical applicability: enhancing model
interpretability through visual grounding [24], incorporating uncertainty handling for unanswerable cases,
enriching semantic reasoning via medical knowledge integration, and extending to multi-turn,
conversational scenarios. These steps can improve the system’s reliability, transparency, and alignment with
real-world clinical workflows.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors utilized generative AI tools. Specifically, Grammarly
was used for proofreading and improving grammar, while ChatGPT was used to enhance the clarity
and readability of sentences. The authors reviewed and edited all content to ensure its accuracy and
take full responsibility for the final manuscript.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>We are grateful to the Kvasir-VQA dataset providers and ImageCLEFmed MEDVQA 2025 organizers
for their essential resources. Our gratitude also extends to the broader open-source community for
their support. We thank Logictronix Technologies for providing computing resources. Due to limited
infrastructure access in an LMIC setting, we used a small subset of the dataset but hope our work
contributes to more inclusive, resource-aware research.
[9] B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y. Yang, X.-M. Wu, Slake: A semantically-labeled
knowledgeenhanced dataset for medical visual question answering, in: 2021 IEEE 18th international
symposium on biomedical imaging (ISBI), IEEE, 2021, pp. 1650–1654.
[10] J. J. Lau, S. Gayen, A. Ben Abacha, D. Demner-Fushman, A dataset of clinically generated visual
questions and answers about radiology images, Scientific data 5 (2018) 1–10.
[11] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the v in vqa matter: Elevating the
role of image understanding in visual question answering, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, 2017, pp. 6904–6913.
[12] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh, Vqa: Visual question
answering, in: Proceedings of the IEEE international conference on computer vision, 2015, pp.
2425–2433.
[13] S. Gautam, A. Storås, C. Midoglu, S. A. Hicks, V. Thambawita, P. Halvorsen, M. A. Riegler,
Kvasirvqa: A text-image pair gi tract dataset, in: Proceedings of the First International Workshop on
Vision-Language Models for Biomedical Applications (VLM4Bio ’24), ACM, 2024, p. 10 pages.
doi:10.1145/3689096.3689458.
[14] Y. Guo, F. Jiao, Z. Shen, L. Nie, M. Kankanhalli, Unk-vqa: A dataset and a probe into the
abstention ability of multi-modal large models, IEEE Transactions on Pattern Analysis and Machine
Intelligence (2024).
[15] J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrapping language-image pre-training for unified
visionlanguage understanding and generation, in: International conference on machine learning, PMLR,
2022, pp. 12888–12900.
[16] H. Liu, C. Li, Y. Li, Y. J. Lee, Improved baselines with visual instruction tuning, in: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26296–26306.
[17] B. Ionescu, H. Müller, D.-C. Stanciu, A. Idrissi-Yaghir, A. Radzhabov, A. G. S. de Herrera, A. Andrei,
A. Storås, A. B. Abacha, B. Bracke, B. Lecouteux, B. Stein, C. Macaire, C. M. Friedrich, C. S. Schmidt,
D. Fabre, D. Schwab, D. Dimitrov, E. Esperança-Rodier, G. Constantin, H. Becker, H. Damm,
H. Schäfer, I. Rodkin, I. Koychev, J. Kiesel, J. Rückert, J. Malvehy, L.-D. S, tefan, L. Bloch, M. Potthast,
M. Heinrich, M. A. Riegler, M. Dogariu, N. Codella, P. H. P. Nakov, R. Brüngel, R. A. Novoa, R. J. Das,
S. A. Hicks, S. Gautam, T. M. G. Pakull, V. Thambawita, V. Kovalev, W.-W. Yim, Z. Xie, Overview
of imageclef 2025: Multimedia retrieval in medical, social media and content recommendation
applications, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings
of the 16th International Conference of the CLEF Association (CLEF 2025), Springer Lecture Notes
in Computer Science LNCS, Madrid, Spain, 2025.
[18] C. Chatfield, Exploratory data analysis, European journal of operational research 23 (1986) 5–13.
[19] B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y. Lu, M. Zeng, C. Liu, L. Yuan, Florence-2: Advancing a
unified representation for a variety of vision tasks, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829.
[20] H. Ding, C. Liu, S. Wang, X. Jiang, Vision-language transformer and query generation for referring
segmentation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021,
pp. 16321–16330.
[21] W. Jin, Y. Cheng, Y. Shen, W. Chen, X. Ren, A good prompt is worth millions of parameters:
Low-resource prompt-based learning for vision-language models, arXiv preprint arXiv:2110.08484
(2021).
[22] J. Xing, J. Liu, J. Wang, L. Sun, X. Chen, X. Gu, Y. Wang, A survey of eficient fine-tuning methods
for vision-language models—prompt and adapter, Computers &amp; Graphics 119 (2024) 103885.
[23] M. Chaichuk, S. Gautam, S. Hicks, E. Tutubalina, Prompt to Polyp: Medical Text-Conditioned</p>
      <p>Image Synthesis with Difusion Models, arXiv (2025). doi: 10.48550/arXiv.2505.05573.
[24] E. Hasanpour Zaryabi, L. Moradi, B. Kalantar, N. Ueda, A. A. Halin, Unboxing the black box of
attention mechanisms in remote sensing big data using xai, Remote Sensing 14 (2022) 6254.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gautam</surname>
          </string-name>
          , Bridging Multimedia Modalities:
          <article-title>Enhanced Multimodal AI Understanding and Intelligent Agents</article-title>
          , in: ACM Conferences,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2023</year>
          , pp.
          <fpage>695</fpage>
          -
          <lpage>699</lpage>
          . doi:
          <volume>10</volume>
          .1145/3577190.3614225.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gautam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          ,
          <article-title>Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy</article-title>
          , arXiv (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .48550/ arXiv.2506.09958. arXiv:
          <volume>2506</volume>
          .
          <fpage>09958</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. Y.</given-names>
            <surname>Zakari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Owusu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. K.</given-names>
            <surname>Lawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <article-title>Vqa and visual reasoning: An overview of recent datasets, methods and challenges</article-title>
          ,
          <source>arXiv preprint arXiv:2212.13296</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L</given-names>
            <surname>.-M. Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.-M. Wu</surname>
          </string-name>
          ,
          <article-title>Medical visual question answering via conditional reasoning</article-title>
          ,
          <source>in: Proceedings of the 28th ACM International Conference on Multimedia</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2345</fpage>
          -
          <lpage>2354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          , L.
          <string-name>
            <surname>-M. Zhan</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.-M. Wu</surname>
          </string-name>
          ,
          <article-title>Contrastive pre-training and representation distillation for medical visual question answering based on radiology images</article-title>
          , in: M. de Bruijne,
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Cattin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Padoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Speidel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , C. Essert (Eds.),
          <source>Medical Image Computing and Computer Assisted Intervention - MICCAI 2021</source>
          , Springer International Publishing, Cham,
          <year>2021</year>
          , pp.
          <fpage>210</fpage>
          -
          <lpage>220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Eslami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Meinel</surname>
          </string-name>
          , G. De Melo,
          <article-title>Pubmedclip: How much does clip benefit visual question answering in the medical domain?</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EACL</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>1181</fpage>
          -
          <lpage>1193</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-H.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Mapping medical image-text to a joint space via masked modeling</article-title>
          ,
          <source>Medical Image Analysis</source>
          <volume>91</volume>
          (
          <year>2024</year>
          )
          <article-title>103018</article-title>
          . URL: https://www. sciencedirect.com/science/article/pii/S1361841523002785. doi:https://doi.org/10.1016/j. media.
          <year>2023</year>
          .
          <volume>103018</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          <article-title>, Multi-modal concept alignment pre-training for generative medical visual question answering</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Findings of the Association for Computational Linguistics: ACL</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>5378</fpage>
          -
          <lpage>5389</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .findings-acl.
          <volume>319</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .findings-acl.
          <volume>319</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>