1. Introduction

Towards Better Gastrointestinal Diagnosis: Evaluating Vision-Language Models For GI VQA

Omar Adjali

0 0 Paris-Saclay University , Gif-sur-Yvette , France

Gastrointestinal (GI) image analysis is critical for early diagnosis and treatment of GI diseases, which remain a leading cause of global morbidity and mortality. Visual Question Answering (VQA) in medical imaging enables interpretable and interactive AI systems to support clinical decision-making. This paper presents our submission to the ImageCLEFmed 2025 MedVQA task, which targets medical VQA on gastrointestinal endoscopic images using the Kvasir-VQA dataset. We evaluate two primary approaches: (1) a multimodal Chain-of-Thought (CoT) reasoning framework that decomposes questions into interpretable reasoning steps, and (2) a simple fine-tuning strategy on large-scale generative models. Extensive experiments across multiple vision-language models, including Qwen2-VL and BLIP2, show that fine-tuning significantly outperforms CoT in both validation and test settings. Our best-performing model, achieves a METEOR score of 50 on the test set. We also carried out qualitative and quantitative analysis to demonstrate the strengths and weaknesses of our best performing approach, and hence suggesting some insights to tackle the most challenging aspects in the Kvasir-vqa task.

eol>Medical VQA ImageCLEFmed 2025 Multimodal AI Clinical Question Answering Synthetic GI Images

1. Introduction 2. Related Work 2.1. MedVQA Approaches

MedVQA has gained significant attention as a critical task in biomedical AI, requiring models to generate accurate textual answers conditioned on visual medical images. Early MedVQA Approaches addressed tasks with limited annotated data. For example, [ 4 ] proposed a framework combining Convolutional Denoising Auto-Encoders (CDAE) and Model-Agnostic Meta-Learning (MAML) to utilize both unlabeled data and few-shot learning. [ 5 ] further introduced a conditional reasoning approach that adapts reasoning strategies based on question types (e.g., closed- vs. open-ended) which significantly improved performance on the VQA-RAD dataset [ 6 ]. To manage the diversity of question types, [ 7 ] also proposed CGMVQA, a hybrid model that handles both classification and generative answering via transformer-based architecture. Further works have employed contrastive learning for better visual feature extraction in low-data regimes. In particular, [ 8 ] proposed a dual approach combining a reasoning module with a contrastively trained visual encoder. Similarly, [ 9 ] fine-tuned CLIP on PubMed image-text pairs, showing notable improvements over visual-only pretrained models through the introduction of the PubMedCLIP encoder. More recently, the generative paradigm has gained interest with the emergence of Large Vision-Language Models (LVLM). [ 10 ] introduced PMC-VQA, a large-scale dataset comprising over 227k image-question-answer pairs, and proposed a generative model fine-tuned for free-form answering. Similarly, [ 11 ] presented LLaVA-Med, trained using a novel curriculum learning strategy on instruction-following data generated by GPT-4, outperforming previous supervised approaches in both accuracy and versatility. In order to improve interpretability which is crucial for clinical applications, recent work leveraged self-reflexion reasoning enabled by large language models (LLMs). For example, [ 12 ] proposed MedCoT, which relies on a multi-expert diagnostic collaboration through hierarchical Chain of thought and Mixture of Experts. [ 13 ] introduced MedThink, which integrates Medical Decision-Making Rationales (MDMRs) into a generative model to make the reasoning process transparent and clinically verifiable.

2.2. MedVQA Datasets

The development of robust and clinically relevant Visual Question Answering (VQA) systems for medicine is heavily dependent on high-quality annotated datasets. Over the past few years, several notable datasets have emerged, each addressing unique aspects of medical image understanding through natural language queries. VQA-RAD [ 14 ] is the first manually curated medical VQA dataset tailored to radiology. It comprises over 3,500 natural question-answer (QA) pairs covering 315 unique radiological images. The questions were authored by clinical trainees with medical imaging experience, ensuring medical realism. Similarly, [ 15 ] introduced PathVQA, the first VQA dataset focused on pathology including open-ended and yes/no questions. More recently, [ 16 ] proposed SLAKE, a large bilingual dataset that covers more body parts with rich semantic labels annotated by experienced physicians. In the context of ImageCLEFmed 2025 MedVQA challenge, [ 17 ] proposed the Kvasir-VQA dataset which extends the HyperKvasir and Kvasir-Instrument datasets by introducing over 52,000 synthetic questionanswer pairs for 6,500 images across various gastrointestinal findings, including polyps, esophagitis, and ulcerative colitis. These QA pairs encompass a range of formats such as yes/no, multiple choice, location, and numerical count, and were validated by medical experts. This dataset targets image captioning, diagnostic VQA, and synthetic image generation, enabling research in GI tract diagnostics and fine-grained instrument recognition. It also supports training generative models such in [ 18 ] for image synthesis based on medical prompts. Finaly, most recentl, [ 19 ] proposed OmniMedVQA, a new large-Scale comprehensive benchmark for evaluating large vision-language models in the medical domain. It comprises 118,010 real medical images and 127,995 question-answer (QA) pairs, collected from 73 distinct datasets, spanning 12 imaging modalities (e.g., MRI, CT, X-Ray, Ultrasound) and over 20 human anatomical regions. OmniMedVQA QA pairs are systematically constructed to evaluate five major medical reasoning capabilities: modality recognition, anatomy identification, disease diagnosis, lesion grading, and biological attributes.

3. Task Overview and Dataset 3.1. Task Formulation

The Medical Visual Question Answering (MedVQA) task aims to develop models that can accurately answer clinically relevant questions about gastrointestinal (GI) endoscopic images. Leveraging the Kvasir-VQA dataset, the task combines computer vision and natural language understanding to simulate expert-level diagnostic reasoning. Formally, given an input medical image and a natural language question associated with the image, the objective is to map the image-question pair to a natural language answer that is accurate and contextually grounded in the image.

3.2. Kvasir-VQA Dataset 4. Methodology

In this paper, we propose exploring two approaches to tackle the ImageCLEFmed 2025 MedVQA (task 1). We first investigate how a multimodal chain of thoughts (CoT) system would perform on the Kvasir-vqa dataset. Then, we evaluate a simple finetuning strategy using the kvasir-vqa training dataset and other medical training data.

4.1. Multimocal CoT

Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to explicitly decompose complex questions into intermediate reasoning steps [ 20, 21 ]. As shown in [ 12 ], MedVQA queries often require multi-step inference that combines clinical knowledge with image interpretation, thus, logical paths can be traced from questions to the final answer. Such a structured decomposition may help mitigate hallucinations and improve answer generation accuracy. Inspired from [ 12 ], we model the ImageCLEFmed 2025 MedVQA task using a multimodal CoT system. Given a Large Vision Languge Model (LVLM) and a question-image input pair (q,i), we perform the following inference steps: 1) We generate a preliminary reasoning rationale R such that: R = LVLM(q, i, P), where P is the prompt instruction used to generate the rationale R. P is formulated as follows:

Rationale Instruction Prompt

You are a Vision Language Model assistant which helps an experienced doctor interpreting accurately interpreting and answering clinical questions based on gastrointestinal images. Given the image, provide a reasonable rationale for the question: {QUESTION}. Please proceed with a step-by-step analysis and provide a rationale.

Subsequently, R is used to generate the final answer A, such that: A = LVLM(q, i, R). We relied on the following prompt:

Answer Generation Prompt

You are a Vision Language Model assistant which helps an experienced doctor interpreting accurately interpreting and answering clinical questions based on gastrointestinal images. Given the image and the rationale: {RATIONALE}, Answer briefly the question: {QUESTION} with no extra text, rationales or explanation.

Since the generated rationale can be inefective with regard to the ground truth answer * , we trained the LVLM on answering Kvasir-vqa questions in order to reduce discrepancies between rationales and the actual answers. The LVLM is trained on the following cross-entropy loss:

ℒgen = − ∑︁ log (* , , , )

=1 where * is the ground truth answer, is the question, is the image, and is the rationale.

4.2. Finetuning strategy

In the second approach, we performed answer generation using a generative model denoted G(·) with parameters Φ . Given a question and the associated image , the answer generator G is trained on the following cross-entropy loss over a batch of question-image pairs: (1) (2) ℒG = − ∑︁ log Φ(* | , ) =1

Where is the -th question, is the image associated with , * is the ground truth answer string for ( , ), Φ(* | , ) is the probability of generating the correct answer from the text-image pair, Φ are the parameters of the multimodal answer generator.

4.3. Training details

We implemented all our experiments in Pytorch [ 22 ] and we relied on the Qwen2-VL-72B-instruct [ 23 ] LVLM for generating reasoning rationales. Afterward, we performed the CoT and finetuning training stages on LVLMs of diferent size and architectures: Qwen2-VL-7B-instruct, Qwen2-VL-32B-instruct [ 23 ] and BLIP2-Flan-T5-XL [ 24 ]. We trained for 10 epochs using a batch size of 4 and a learning rate of 2e-5 on a single A100 GPU. Throughout all finetuning experiments, the LVLMs are trained using LoRA [ 25 ] for eficient parameters optimization with the following configurations: { = 8, _ℎ = 32, _ = 0.1} with BLIP2-Flan-T5-XL and { = 8, _ℎ = 16, _ = 0.05} for Qwen’s models. At inference time, decoding is performed using 3 beams search. Model checkpoint selection was done based on validation meteor performance.

5. Results and Discussion

We evaluated both Chain-of-Thought (CoT) and fine-tuned (FT) models using BLEU, ROUGE, and METEOR scores. The Qwen2-VL and BLIP2 model architectures were evaluated for both methodologies. We additionally assessed our best performing model using the Exact Match metric to perform qualitative and quantitative analysis. Where in the image is center; center-left; the abnormality? center-right

Table 2 show that the BLIP2 model fine-tuned on the Kvasir-VQA dataset achieved the best overall performance on the Kvsair-vqa validation set. Note that, to achieve the best performance on the test set, What color is the red; pink; white anatomical landmark? If more than one separate with ; What is the size of the 5–10 mm polyp? Are there any abnor- ulcerative colitis malities in the image? Check all that are present.

What type of pro- gastroscopy cedure is the image taken from? What type of pro- gastroscopy cedure is the image taken from? we further finetuned the BLIP2 model on the training sets of PathVQA [ 15 ], VQA-RAD [ 14 ], SLAKE [ 16 ] and OmniMedVQA [ 19 ] datasets allowing to achieve a METEOR score of 50 and a BLEU score of 22. In contrast, while Chain-of-Thought prompting enhances in general interpretability by providing intermediate reasoning, its practical efectiveness on the Kvasir-vqa dataset seems limited without additional rationale supervision. We believe that instruction finetuning of the Qwen2-VL-72B-instruct we used to generate the reasoning rationales on medical-domain data would help providing more comprehensive rationales (less noisy rationales) and thus alleviating the answer/rationale discrepancies.

Furthermore, the performance gap between BLIP2-Flan-T5-XL and Qwen2-VL models is worth noting. Indeed, BLIP2-Flan-T5-XL consistently outperforms Qwen2-VL-7B-instruct whatever the training method and has comparable performance with Qwen2-VL-32B-instruct in the CoT setting despite their diference in model size. Besides, given that our experiment LoRA configuration reduces the number of trainable parameters, we found that: BLIP2-Flan-T5-XL has 4.7M, Qwen2-VL-7B has 2.5M and Qwen2VL-32 has 8.3M trainable parameters. This shows that BLIP2-Flan-T5-XL shows superior capabilities on the Kvasir-vqa task despite its relative size. We believe that encoder-decoder architectures such as BLIP2 are more suitable for VQA tasks as they allow to encode rich image features before generating the textual output, facilitating better multimodal alignment, while decoder-only models like Qwen2-VL must process the image and question together through a single stream, which may limit fine-grained control over visual and textual token interactions during generation.

Table 5 shows the exact match evaluation results of our best performing model (BLIP2-FT) by image category. We achieved the highest EM scores of 99.02% and 97.83% respectively for Normal Colon and Normal Esophagus image catgories. This due to the low variability in answers, as all the questions related to these image categories cover only yes/no question types which seem to be an easy task for BLIP2 model finetuned on similar data. We see in Table 4 the only examples of these image categories where our model wrongly predicts the yes/no questions. Moreover, Table 6 and Table 7 show that our model achieves respectively 96.23% and 91.94% EM scores on questions with yes/no answers whatever the image category.

Our best BLIP2 model also achieved solid EM performance on the following image categories: Cecum with 84.16%, Pylorus with 81.82%, and Esophagitis 79.22%, indicating its relative ability in identifying specific anatomical regions and whether some pathological signs are present. In contrast, our model struggled the most on questions related to the Polyp image category with the lowest EM score of 48.74%. On the one hand, answering questions about polyp requires the model to consistently identify more subtle image features and on the other hand, the polyp image category in the dataset cover a wider range of question types including among others: yes/no, color-related, counting and location-related questions. Similarly, the Instruments image category is also challenging for our model which yielded an EM score of 60.87%, as it also covers several question types which require distinguishing medical instruments from the background. These results suggest that the model may greatly benefit from more advanced and specific reasoning abilities such as visual spatial reasoning in order for example to accurately answer location-related questions for which our model achieves poor results (see Tables 6 and 7).

6. Conclusion

This paper presented two simple approaches for tackling the ImageCLEFmed 2025 MedVQA challenge using the Kvasir-VQA dataset. While the Chain-of-Thought approach ofered insights into the reasoning process behind answer generation, fine-tuning large generative models achieved significantly better performance across all evaluation metrics. Our experiments demonstrate the efectiveness of large vision-language models like BLIP2 when adapted to domain-specific medical tasks. Qualitative and quantitative analysis show that endowing the model with more complex visual reasoning abilities might improve the VQA performance on the questions related to the most challenging image categories namely Polyp and Instruments.

Declaration on Generative AI

During the preparation of this work, we acknowledge the use of generative AI tools (Chat-GPT-4) for only spell checking, paraphrasing, and latex formatting purposes. After using Chat-GPT-4, we systematically reviewed and edited all the content as needed and take full responsibility for the publication’s content.

A. Additional Quantitative Results

The following tables shows our best performing BLIP2 model results by answer ( or question type) on the validation split. 4 2 1 1 5 3 1 2 6 1 18 3 1 1 1 1

[1]

Borgli ,

Thambawita ,

P. H.

Smedsrud ,

Hicks ,

Jha ,

S. L.

Eskeland ,

K. R.

Randel ,

Pogorelov ,

Lux , D. T. D. Nguyen , et al., HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy , Sci. Data 7 ( 2020 ) 1 - 14 . doi: 10 .1038/s41597-020-00622-y.

[2]

Ionescu ,

Müller , A.-M. Drăgulinescu , J.

Rückert , A. Ben

Abacha , A.

García Seco de Herrera , L. Bloch, R.

Brüngel , A.

Idrissi-Yaghir , H.

Schäfer , et al., Overview of the ImageCLEF 2024 : Multimedia Retrieval in Medical Applications, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction, Springer, Cham, Switzerland, 2024 , pp. 140 - 164 . doi: 10 .1007/978-3- 031 -71908- 0 _ 7 .

[3]

Ionescu ,

Müller ,

D.-C.

Stanciu ,

Idrissi-Yaghir ,

Radzhabov , A. G. S. de Herrera , A.

Andrei , A.

Storås , A. B.

Abacha , B.

Bracke , B.

Lecouteux , B.

Stein , C.

Macaire , C. M.

Friedrich , C. S.

Schmidt , D.

Fabre , D.

Schwab , D.

Dimitrov , E.

Esperança-Rodier , G. Constantin, H.

Becker , H.

Damm , H.

Schäfer , I. Rodkin , I. Koychev ,

Kiesel ,

Rückert ,

Malvehy , L.-D. S, tefan, L. Bloch,

Potthast ,

Heinrich ,

M. A.

Riegler ,

Dogariu ,

Codella ,

P. H. P.

Nakov ,

Brüngel ,

R. A.

Novoa , R. J. Das , S. A.

Hicks , S.

Gautam , T. M. G.

Pakull , V.

Thambawita , V.

Kovalev , W.-W.

Yim , Z.

Xie , Overview of imageclef 2025: Multimedia retrieval in medical, social media and content recommendation applications, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction , Proceedings of the 16th International Conference of the CLEF Association (CLEF 2025 ), Springer Lecture Notes in Computer Science LNCS, Madrid, Spain, 2025 .

[4]

B. D.

Nguyen , T.-T. Do,

B. X.

Nguyen ,

Do ,

Tjiputra ,

Q. D.

Tran , Overcoming data limitation in medical visual question answering, in: Medical Image Computing and Computer Assisted Intervention-MICCAI 2019 : 22nd International Conference, Shenzhen, China, October 13-17 , 2019 , Proceedings, Part IV 22 , Springer, 2019 , pp. 522 - 530 .

[5]

.-M. Zhan ,

Liu ,

Fan ,

Chen , X.-M. Wu , Medical visual question answering via conditional reasoning , in: Proceedings of the 28th ACM International Conference on Multimedia , 2020 , pp. 2345 - 2354 .

[6]

J. J.

Lau ,

Gayen ,

A. Ben

Abacha ,

Demner-Fushman , A dataset of clinically generated visual questions and answers about radiology images , Scientific data 5 ( 2018 ) 1 - 10 .

[7]

Ren ,

Zhou , Cgmvqa: A new classification and generative model for medical visual question answering , IEEE Access 8 ( 2020 ) 50626 - 50636 .

[8]

Liu , L. -M. Zhan , L. Xu , X.-M. Wu , Medical visual question answering via conditional reasoning and contrastive learning , IEEE transactions on medical imaging 42 ( 2022 ) 1532 - 1545 .

[9]

Eslami ,

Meinel , G. De Melo, Pubmedclip: How much does clip benefit visual question answering in the medical domain? , in: Findings of the Association for Computational Linguistics: EACL 2023 , 2023 , pp. 1181 - 1193 .

[10]

Zhang ,

Wu ,

Zhao ,

Lin ,

Zhang ,

Wang ,

Xie , Pmc-vqa: Visual instruction tuning for medical visual question answering , arXiv preprint arXiv:2305.10415 ( 2023 ).

[11]

Li ,

Wong ,

Zhang ,

Usuyama , H. Liu,

Yang ,

Naumann ,

Poon ,

Gao , Llava-med: Training a large language-and-vision assistant for biomedicine in one day , Advances in Neural Information Processing Systems 36 ( 2023 ) 28541 - 28564 .

[12]

Liu ,

Wang ,

Du ,

J. T.

Zhou ,

Liu , Medcot: Medical chain of thought via hierarchical expert , arXiv preprint arXiv:2412.13736 ( 2024 ).

[13]

Gai ,

Zhou , J. Liu,

Feng ,

Wu ,

Liu , Medthink: Explaining medical visual question answering via multimodal decision-making rationale , arXiv preprint arXiv:2404.12372 ( 2024 ).

[14]

J. J.

Lau ,

Gayen ,

A. Ben

Abacha ,

Demner-Fushman , A dataset of clinically generated visual questions and answers about radiology images , Scientific data 5 ( 2018 ) 1 - 10 .

[15]

He ,

Zhang ,

Mou , E. Xing,

Xie , Pathvqa: 30000 + questions for medical visual question answering , arXiv preprint arXiv: 2003 . 10286 ( 2020 ).

[16]

Liu , L. -M. Zhan , L.

Xu , L. Ma, Y.

Yang , X.-M. Wu , Slake: A semantically-labeled knowledgeenhanced dataset for medical visual question answering , in: 2021 IEEE 18th international symposium on biomedical imaging (ISBI) , IEEE, 2021 , pp. 1650 - 1654 .

[17]

Gautam ,

A. M.

Storås ,

Midoglu ,

S. A.

Hicks ,

Thambawita ,

Halvorsen ,

M. A.

Riegler , Kvasir-VQA: A Text-Image Pair GI Tract Dataset , in: ACM Conferences, Association for Computing Machinery , New York, NY, USA, 2024 , pp. 3 - 12 . doi: 10 .1145/3689096.3689458.

[18]

Chaichuk ,

Gautam ,

Hicks , E. Tutubalina, Prompt to Polyp: Medical Text-Conditioned Image Synthesis with Difusion Models, arXiv ( 2025 ). doi: 10 .48550/arXiv.2505.05573.

[19]

Hu ,

Li ,

Lu ,

Shao ,

He ,

Qiao ,

Luo , Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2024 , pp. 22170 - 22183 .

[20]

Lu ,

Mishra ,

Xia ,

Qiu ,

K.-W.

Chang ,

S.-C.

Zhu ,

Tafjord ,

Clark ,

Kalyan , Learn to explain: Multimodal reasoning via thought chains for science question answering , Advances in Neural Information Processing Systems 35 ( 2022 ) 2507 - 2521 .

[21]

Zheng ,

Yang ,

Tang , H. -Y. Zhou , S. Yang , Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models , Advances in Neural Information Processing Systems 36 ( 2023 ) 5168 - 5191 .

[22]

Paszke ,

Gross ,

Chintala ,

Chanan ,

Yang ,

DeVito ,

Lin ,

Desmaison ,

Antiga ,

Lerer , Automatic diferentiation in pytorch , in: NIPS 2017 Workshop on Autodif, MIT Press, Long Beach, CA, USA, 2017 .

[23]

Wang ,

Bai ,

Tan ,

Wang ,

Fan ,

Bai ,

Chen ,

Liu ,

Wang ,

Ge , et al., Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution , arXiv preprint arXiv:2409.12191 ( 2024 ).

[24]

Li ,

Savarese ,

Hoi , Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , in: International conference on machine learning, PMLR , 2023 , pp. 19730 - 19742 .

[25]

E. J.

Hu , yelong shen, P. Wallis,

Allen-Zhu ,

Li ,

Wang ,

Wang , W. Chen, LoRA: Low-rank adaptation of large language models , in: International Conference on Learning Representations , 2022 . URL: https://openreview.net/forum?id=nZeVKeeFYf9.