<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Better Gastrointestinal Diagnosis: Evaluating Vision-Language Models For GI VQA</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Omar Adjali</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Paris-Saclay University</institution>
          ,
          <addr-line>Gif-sur-Yvette</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Gastrointestinal (GI) image analysis is critical for early diagnosis and treatment of GI diseases, which remain a leading cause of global morbidity and mortality. Visual Question Answering (VQA) in medical imaging enables interpretable and interactive AI systems to support clinical decision-making. This paper presents our submission to the ImageCLEFmed 2025 MedVQA task, which targets medical VQA on gastrointestinal endoscopic images using the Kvasir-VQA dataset. We evaluate two primary approaches: (1) a multimodal Chain-of-Thought (CoT) reasoning framework that decomposes questions into interpretable reasoning steps, and (2) a simple fine-tuning strategy on large-scale generative models. Extensive experiments across multiple vision-language models, including Qwen2-VL and BLIP2, show that fine-tuning significantly outperforms CoT in both validation and test settings. Our best-performing model, achieves a METEOR score of 50 on the test set. We also carried out qualitative and quantitative analysis to demonstrate the strengths and weaknesses of our best performing approach, and hence suggesting some insights to tackle the most challenging aspects in the Kvasir-vqa task.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Medical VQA</kwd>
        <kwd>ImageCLEFmed 2025</kwd>
        <kwd>Multimodal AI</kwd>
        <kwd>Clinical Question Answering</kwd>
        <kwd>Synthetic GI Images</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. MedVQA Approaches</title>
        <p>
          MedVQA has gained significant attention as a critical task in biomedical AI, requiring models to
generate accurate textual answers conditioned on visual medical images. Early MedVQA Approaches
addressed tasks with limited annotated data. For example, [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] proposed a framework combining
Convolutional Denoising Auto-Encoders (CDAE) and Model-Agnostic Meta-Learning (MAML) to utilize
both unlabeled data and few-shot learning. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] further introduced a conditional reasoning approach that
adapts reasoning strategies based on question types (e.g., closed- vs. open-ended) which significantly
improved performance on the VQA-RAD dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. To manage the diversity of question types, [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
also proposed CGMVQA, a hybrid model that handles both classification and generative answering
via transformer-based architecture. Further works have employed contrastive learning for better
visual feature extraction in low-data regimes. In particular, [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] proposed a dual approach combining
a reasoning module with a contrastively trained visual encoder. Similarly, [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] fine-tuned CLIP on
PubMed image-text pairs, showing notable improvements over visual-only pretrained models through
the introduction of the PubMedCLIP encoder. More recently, the generative paradigm has gained
interest with the emergence of Large Vision-Language Models (LVLM). [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] introduced PMC-VQA,
a large-scale dataset comprising over 227k image-question-answer pairs, and proposed a generative
model fine-tuned for free-form answering. Similarly, [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] presented LLaVA-Med, trained using a
novel curriculum learning strategy on instruction-following data generated by GPT-4, outperforming
previous supervised approaches in both accuracy and versatility. In order to improve interpretability
which is crucial for clinical applications, recent work leveraged self-reflexion reasoning enabled by
large language models (LLMs). For example, [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] proposed MedCoT, which relies on a multi-expert
diagnostic collaboration through hierarchical Chain of thought and Mixture of Experts. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] introduced
MedThink, which integrates Medical Decision-Making Rationales (MDMRs) into a generative model to
make the reasoning process transparent and clinically verifiable.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. MedVQA Datasets</title>
        <p>
          The development of robust and clinically relevant Visual Question Answering (VQA) systems for
medicine is heavily dependent on high-quality annotated datasets. Over the past few years, several
notable datasets have emerged, each addressing unique aspects of medical image understanding through
natural language queries. VQA-RAD [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] is the first manually curated medical VQA dataset tailored to
radiology. It comprises over 3,500 natural question-answer (QA) pairs covering 315 unique radiological
images. The questions were authored by clinical trainees with medical imaging experience, ensuring
medical realism. Similarly, [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] introduced PathVQA, the first VQA dataset focused on pathology
including open-ended and yes/no questions. More recently, [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] proposed SLAKE, a large bilingual
dataset that covers more body parts with rich semantic labels annotated by experienced physicians. In
the context of ImageCLEFmed 2025 MedVQA challenge, [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] proposed the Kvasir-VQA dataset which
extends the HyperKvasir and Kvasir-Instrument datasets by introducing over 52,000 synthetic
questionanswer pairs for 6,500 images across various gastrointestinal findings, including polyps, esophagitis,
and ulcerative colitis. These QA pairs encompass a range of formats such as yes/no, multiple choice,
location, and numerical count, and were validated by medical experts. This dataset targets image
captioning, diagnostic VQA, and synthetic image generation, enabling research in GI tract diagnostics
and fine-grained instrument recognition. It also supports training generative models such in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] for
image synthesis based on medical prompts. Finaly, most recentl, [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] proposed OmniMedVQA, a new
large-Scale comprehensive benchmark for evaluating large vision-language models in the medical
domain. It comprises 118,010 real medical images and 127,995 question-answer (QA) pairs, collected
from 73 distinct datasets, spanning 12 imaging modalities (e.g., MRI, CT, X-Ray, Ultrasound) and over
20 human anatomical regions. OmniMedVQA QA pairs are systematically constructed to evaluate five
major medical reasoning capabilities: modality recognition, anatomy identification, disease diagnosis,
lesion grading, and biological attributes.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Task Overview and Dataset</title>
      <sec id="sec-3-1">
        <title>3.1. Task Formulation</title>
        <p>The Medical Visual Question Answering (MedVQA) task aims to develop models that can accurately
answer clinically relevant questions about gastrointestinal (GI) endoscopic images. Leveraging the
Kvasir-VQA dataset, the task combines computer vision and natural language understanding to simulate
expert-level diagnostic reasoning. Formally, given an input medical image  and a natural language
question  associated with the image, the objective is to map the image-question pair to a natural
language answer  that is accurate and contextually grounded in the image.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Kvasir-VQA Dataset</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>In this paper, we propose exploring two approaches to tackle the ImageCLEFmed 2025 MedVQA (task 1).
We first investigate how a multimodal chain of thoughts (CoT) system would perform on the Kvasir-vqa
dataset. Then, we evaluate a simple finetuning strategy using the kvasir-vqa training dataset and other
medical training data.</p>
      <sec id="sec-4-1">
        <title>4.1. Multimocal CoT</title>
        <p>
          Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to explicitly decompose
complex questions into intermediate reasoning steps [
          <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
          ]. As shown in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], MedVQA queries often
require multi-step inference that combines clinical knowledge with image interpretation, thus, logical
paths can be traced from questions to the final answer. Such a structured decomposition may help
mitigate hallucinations and improve answer generation accuracy. Inspired from [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], we model the
ImageCLEFmed 2025 MedVQA task using a multimodal CoT system. Given a Large Vision Languge
Model (LVLM) and a question-image input pair (q,i), we perform the following inference steps: 1)
We generate a preliminary reasoning rationale R such that: R = LVLM(q, i, P), where P is the prompt
instruction used to generate the rationale R. P is formulated as follows:
        </p>
        <sec id="sec-4-1-1">
          <title>Rationale Instruction Prompt</title>
          <p>You are a Vision Language Model assistant which helps an experienced doctor interpreting
accurately interpreting and answering clinical questions based on gastrointestinal images. Given
the image, provide a reasonable rationale for the question: {QUESTION}. Please proceed with a
step-by-step analysis and provide a rationale.</p>
          <p>Subsequently, R is used to generate the final answer A, such that: A = LVLM(q, i, R). We relied on the
following prompt:</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Answer Generation Prompt</title>
          <p>You are a Vision Language Model assistant which helps an experienced doctor interpreting
accurately interpreting and answering clinical questions based on gastrointestinal images. Given
the image and the rationale: {RATIONALE}, Answer briefly the question: {QUESTION} with
no extra text, rationales or explanation.</p>
          <p>Since the generated rationale can be inefective with regard to the ground truth answer * , we trained
the LVLM on answering Kvasir-vqa questions in order to reduce discrepancies between rationales and
the actual answers. The LVLM is trained on the following cross-entropy loss:</p>
          <p>ℒgen = − ∑︁ log  (* , , , )</p>
          <p>=1
where * is the ground truth answer,  is the question,  is the image, and  is the rationale.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Finetuning strategy</title>
        <p>In the second approach, we performed answer generation using a generative model denoted G(·) with
parameters Φ . Given a question  and the associated image  , the answer generator G is trained on
the following cross-entropy loss over a batch of  question-image pairs:
(1)
(2)
ℒG = −

∑︁ log Φ(* |  ,  )
=1</p>
        <p>Where  is the -th question,  is the image associated with , * is the ground truth answer string
for ( ,  ), Φ(* |  ,  ) is the probability of generating the correct answer from the text-image pair,
Φ are the parameters of the multimodal answer generator.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Training details</title>
        <p>
          We implemented all our experiments in Pytorch [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] and we relied on the Qwen2-VL-72B-instruct [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]
LVLM for generating reasoning rationales. Afterward, we performed the CoT and finetuning training
stages on LVLMs of diferent size and architectures: Qwen2-VL-7B-instruct, Qwen2-VL-32B-instruct [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]
and BLIP2-Flan-T5-XL [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. We trained for 10 epochs using a batch size of 4 and a learning rate of 2e-5
on a single A100 GPU. Throughout all finetuning experiments, the LVLMs are trained using LoRA [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]
for eficient parameters optimization with the following configurations: { = 8, _ℎ = 32,
_ = 0.1} with BLIP2-Flan-T5-XL and { = 8, _ℎ = 16, _ = 0.05}
for Qwen’s models. At inference time, decoding is performed using 3 beams search. Model checkpoint
selection was done based on validation meteor performance.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>We evaluated both Chain-of-Thought (CoT) and fine-tuned (FT) models using BLEU, ROUGE, and
METEOR scores. The Qwen2-VL and BLIP2 model architectures were evaluated for both methodologies.
We additionally assessed our best performing model using the Exact Match metric to perform qualitative
and quantitative analysis.
Where in the image is center; center-left;
the abnormality? center-right</p>
      <p>Table 2 show that the BLIP2 model fine-tuned on the Kvasir-VQA dataset achieved the best overall
performance on the Kvsair-vqa validation set. Note that, to achieve the best performance on the test set,
What color is the red; pink; white
anatomical landmark?
If more than one
separate with ;
What is the size of the 5–10 mm
polyp?
Are there any abnor- ulcerative colitis
malities in the
image? Check all that are
present.</p>
      <p>
        What type of pro- gastroscopy
cedure is the image
taken from?
What type of pro- gastroscopy
cedure is the image
taken from?
we further finetuned the BLIP2 model on the training sets of PathVQA [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], VQA-RAD [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], SLAKE [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
and OmniMedVQA [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] datasets allowing to achieve a METEOR score of 50 and a BLEU score of
22. In contrast, while Chain-of-Thought prompting enhances in general interpretability by providing
intermediate reasoning, its practical efectiveness on the Kvasir-vqa dataset seems limited without
additional rationale supervision. We believe that instruction finetuning of the Qwen2-VL-72B-instruct
we used to generate the reasoning rationales on medical-domain data would help providing more
comprehensive rationales (less noisy rationales) and thus alleviating the answer/rationale discrepancies.
      </p>
      <p>Furthermore, the performance gap between BLIP2-Flan-T5-XL and Qwen2-VL models is worth noting.
Indeed, BLIP2-Flan-T5-XL consistently outperforms Qwen2-VL-7B-instruct whatever the training
method and has comparable performance with Qwen2-VL-32B-instruct in the CoT setting despite their
diference in model size. Besides, given that our experiment LoRA configuration reduces the number of
trainable parameters, we found that: BLIP2-Flan-T5-XL has 4.7M, Qwen2-VL-7B has 2.5M and
Qwen2VL-32 has 8.3M trainable parameters. This shows that BLIP2-Flan-T5-XL shows superior capabilities
on the Kvasir-vqa task despite its relative size. We believe that encoder-decoder architectures such as
BLIP2 are more suitable for VQA tasks as they allow to encode rich image features before generating
the textual output, facilitating better multimodal alignment, while decoder-only models like Qwen2-VL
must process the image and question together through a single stream, which may limit fine-grained
control over visual and textual token interactions during generation.</p>
      <p>Table 5 shows the exact match evaluation results of our best performing model (BLIP2-FT) by image
category. We achieved the highest EM scores of 99.02% and 97.83% respectively for Normal Colon and
Normal Esophagus image catgories. This due to the low variability in answers, as all the questions
related to these image categories cover only yes/no question types which seem to be an easy task for
BLIP2 model finetuned on similar data. We see in Table 4 the only examples of these image categories
where our model wrongly predicts the yes/no questions. Moreover, Table 6 and Table 7 show that our
model achieves respectively 96.23% and 91.94% EM scores on questions with yes/no answers whatever
the image category.</p>
      <p>Our best BLIP2 model also achieved solid EM performance on the following image categories: Cecum
with 84.16%, Pylorus with 81.82%, and Esophagitis 79.22%, indicating its relative ability in identifying
specific anatomical regions and whether some pathological signs are present. In contrast, our model
struggled the most on questions related to the Polyp image category with the lowest EM score of 48.74%.
On the one hand, answering questions about polyp requires the model to consistently identify more
subtle image features and on the other hand, the polyp image category in the dataset cover a wider
range of question types including among others: yes/no, color-related, counting and location-related
questions. Similarly, the Instruments image category is also challenging for our model which yielded
an EM score of 60.87%, as it also covers several question types which require distinguishing medical
instruments from the background. These results suggest that the model may greatly benefit from
more advanced and specific reasoning abilities such as visual spatial reasoning in order for example to
accurately answer location-related questions for which our model achieves poor results (see Tables 6
and 7).</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper presented two simple approaches for tackling the ImageCLEFmed 2025 MedVQA challenge
using the Kvasir-VQA dataset. While the Chain-of-Thought approach ofered insights into the reasoning
process behind answer generation, fine-tuning large generative models achieved significantly better
performance across all evaluation metrics. Our experiments demonstrate the efectiveness of large
vision-language models like BLIP2 when adapted to domain-specific medical tasks. Qualitative and
quantitative analysis show that endowing the model with more complex visual reasoning abilities
might improve the VQA performance on the questions related to the most challenging image categories
namely Polyp and Instruments.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, we acknowledge the use of generative AI tools (Chat-GPT-4) for only
spell checking, paraphrasing, and latex formatting purposes. After using Chat-GPT-4, we systematically
reviewed and edited all the content as needed and take full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Additional Quantitative Results</title>
      <p>The following tables shows our best performing BLIP2 model results by answer ( or question type) on
the validation split.
4
2
1
1
5
3
1
2
6
1
18
3
1
1
1
1</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Borgli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Thambawita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Smedsrud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Eskeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Randel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lux</surname>
          </string-name>
          , D. T. D.
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , et al.,
          <article-title>HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy</article-title>
          ,
          <source>Sci. Data</source>
          <volume>7</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          . doi:
          <volume>10</volume>
          .1038/s41597-020-00622-y.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.-M. Drăgulinescu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Rückert</surname>
            ,
            <given-names>A. Ben</given-names>
          </string-name>
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            , L. Bloch,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Brüngel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Idrissi-Yaghir</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schäfer</surname>
          </string-name>
          , et al.,
          <source>Overview of the ImageCLEF</source>
          <year>2024</year>
          :
          <article-title>Multimedia Retrieval in Medical Applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction, Springer, Cham, Switzerland,
          <year>2024</year>
          , pp.
          <fpage>140</fpage>
          -
          <lpage>164</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -71908-
          <issue>0</issue>
          _
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.-C.</given-names>
            <surname>Stanciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Andrei</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Storås</surname>
            ,
            <given-names>A. B.</given-names>
          </string-name>
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Bracke</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lecouteux</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macaire</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C. S.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fabre</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Esperança-Rodier</surname>
            , G. Constantin,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Becker</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Damm</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schäfer</surname>
            ,
            <given-names>I. Rodkin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>I. Koychev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          , L.-D. S, tefan, L. Bloch,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. H. P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. J. Das</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gautam</surname>
            ,
            <given-names>T. M. G.</given-names>
          </string-name>
          <string-name>
            <surname>Pakull</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Thambawita</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>W.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Yim</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
          </string-name>
          , Overview of imageclef 2025:
          <article-title>Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , T.-T. Do,
          <string-name>
            <given-names>B. X.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Do</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tjiputra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. D.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <article-title>Overcoming data limitation in medical visual question answering, in: Medical Image Computing and Computer Assisted Intervention-MICCAI</article-title>
          <year>2019</year>
          : 22nd International Conference, Shenzhen, China,
          <source>October 13-17</source>
          ,
          <year>2019</year>
          , Proceedings,
          <source>Part IV 22</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>522</fpage>
          -
          <lpage>530</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L</given-names>
            <surname>.-M. Zhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.-M. Wu</surname>
          </string-name>
          ,
          <article-title>Medical visual question answering via conditional reasoning</article-title>
          ,
          <source>in: Proceedings of the 28th ACM International Conference on Multimedia</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2345</fpage>
          -
          <lpage>2354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Lau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gayen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <article-title>A dataset of clinically generated visual questions and answers about radiology images</article-title>
          ,
          <source>Scientific data 5</source>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Cgmvqa: A new classification and generative model for medical visual question answering</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>50626</fpage>
          -
          <lpage>50636</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          , L.
          <string-name>
            <surname>-M. Zhan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.-M. Wu</surname>
          </string-name>
          ,
          <article-title>Medical visual question answering via conditional reasoning and contrastive learning</article-title>
          ,
          <source>IEEE transactions on medical imaging 42</source>
          (
          <year>2022</year>
          )
          <fpage>1532</fpage>
          -
          <lpage>1545</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Eslami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Meinel</surname>
          </string-name>
          , G. De Melo,
          <article-title>Pubmedclip: How much does clip benefit visual question answering in the medical domain?</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EACL</source>
          <year>2023</year>
          ,
          <year>2023</year>
          , pp.
          <fpage>1181</fpage>
          -
          <lpage>1193</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xie</surname>
          </string-name>
          , Pmc-vqa:
          <article-title>Visual instruction tuning for medical visual question answering</article-title>
          ,
          <source>arXiv preprint arXiv:2305.10415</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Llava-med: Training a large language-and-vision assistant for biomedicine in one day</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2023</year>
          )
          <fpage>28541</fpage>
          -
          <lpage>28564</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Medcot:
          <article-title>Medical chain of thought via hierarchical expert</article-title>
          ,
          <source>arXiv preprint arXiv:2412.13736</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>X.</given-names>
            <surname>Gai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Medthink:
          <article-title>Explaining medical visual question answering via multimodal decision-making rationale</article-title>
          ,
          <source>arXiv preprint arXiv:2404.12372</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Lau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gayen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <article-title>A dataset of clinically generated visual questions and answers about radiology images</article-title>
          ,
          <source>Scientific data 5</source>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mou</surname>
          </string-name>
          , E. Xing,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          , Pathvqa:
          <volume>30000</volume>
          +
          <article-title>questions for medical visual question answering</article-title>
          , arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>10286</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          , L.
          <string-name>
            <surname>-M. Zhan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            , L. Ma,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.-M. Wu</surname>
          </string-name>
          ,
          <article-title>Slake: A semantically-labeled knowledgeenhanced dataset for medical visual question answering</article-title>
          ,
          <source>in: 2021 IEEE 18th international symposium on biomedical imaging (ISBI)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>1650</fpage>
          -
          <lpage>1654</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gautam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Storås</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Midoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Thambawita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <article-title>Kvasir-VQA: A Text-Image Pair GI Tract Dataset</article-title>
          , in: ACM Conferences,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2024</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .1145/3689096.3689458.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chaichuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gautam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hicks</surname>
          </string-name>
          , E. Tutubalina, Prompt to Polyp:
          <article-title>Medical Text-Conditioned Image Synthesis with Difusion Models, arXiv (</article-title>
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2505.05573.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>22170</fpage>
          -
          <lpage>22183</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Tafjord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kalyan</surname>
          </string-name>
          ,
          <article-title>Learn to explain: Multimodal reasoning via thought chains for science question answering</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>2507</fpage>
          -
          <lpage>2521</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-Y. Zhou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2023</year>
          )
          <fpage>5168</fpage>
          -
          <lpage>5191</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>DeVito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <article-title>Automatic diferentiation in pytorch</article-title>
          , in: NIPS 2017 Workshop on Autodif, MIT Press, Long Beach, CA, USA,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ge</surname>
          </string-name>
          , et al.,
          <article-title>Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution</article-title>
          ,
          <source>arXiv preprint arXiv:2409.12191</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>19730</fpage>
          -
          <lpage>19742</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          , yelong shen, P. Wallis,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Chen, LoRA:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          . URL: https://openreview.net/forum?id=nZeVKeeFYf9.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>