<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Journal of Computer Assisted Radiol</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1038/s41598-018-21758-3</article-id>
      <title-group>
        <article-title>Overview of ImageCLEFmedical 2025 - Visual Question Answering and Synthetic Image Generation for Gastrointestinal Tract</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sushant Gautam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vajira Thambawita</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Riegler</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pål Halvorsen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Steven Hicks</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>OsloMet - Oslo Metropolitan University</institution>
          ,
          <addr-line>Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Simula Research Laboratory</institution>
          ,
          <addr-line>Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>SimulaMet - Simula Metropolitan Center for Digital Engineering</institution>
          ,
          <addr-line>Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>10550</volume>
      <fpage>357</fpage>
      <lpage>366</lpage>
      <abstract>
        <p>This paper provides an overview of the third edition of the Medical Visual Question Answering for the Gastrointestinal Tract (MedVQA-GI) challenge, hosted at ImageCLEF 2025. Building on the experiences gained from the last two editions, this year's challenge presented two tasks: (1) Visual Question Answering (VQA) over gastrointestinal (GI) images and (2) high-fidelity synthetic image generation for GI data. Participants were asked to develop multimodal models capable of answering clinical questions based on annotated images and to generate synthetic GI images using text prompts. The dataset was extended from previous years and provided a wide variety of GI images with annotations. Submissions were evaluated using a mix of text generation metrics and image realism metrics. Participation increased slightly from last year, but completion rates remained a challenge. This paper details the tasks, data, evaluation methods, and results. The competition repository is at: github.com/simula/ImageCLEFmed-MEDVQA-GI-2025.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Visual question answering</kwd>
        <kwd>Synthetic medical images</kwd>
        <kwd>Endoscopy</kwd>
        <kwd>Machine learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The third edition of the Medical Visual Question Answering for the Gastrointestinal Tract (MedVQA-GI)
challenge at ImageCLEF continued our focus on advanced image-based machine learning for
gastrointestinal (GI) diagnostics. This year we expanded the challenge to include both question answering and
text-to-image synthesis tasks. These additions aim to better simulate real-world diagnostic settings by
incorporating both image interpretation and generation capabilities into AI systems.</p>
      <p>
        Machine learning has long been applied to support lesion detection in gastrointestinal (GI) images [
        <xref ref-type="bibr" rid="ref1 ref2">1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11</xref>
        ]. Historically, most eforts have centered on identifying abnormalities such
as polyps in image or video data [12, 13, 14, 15, 16, 17, 18], and multiple shared tasks have advanced
the field through organized benchmarking [ 19, 20, 21, 22, 23]. More recently, there has been growing
attention on the use of generative models to create synthetic GI images [24, 25]. These images serve
as privacy-preserving alternatives to real data and can be useful for model development, clinician
training, and system evaluation. To reflect these trends, this year’s MedVQA-GI challenge incorporates
both diagnostic reasoning via VQA and synthetic image generation. All data and supporting code are
available in our public repository1.
      </p>
      <p>The remainder of this paper is organized as follows. First, we describe the dataset creation and
structure. Then, we present the two challenge tasks along with the evaluation methodology. Finally, we
discuss the submissions, results, and lessons learned from this year.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>The dataset used in this challenge builds upon the publicly available HyperKvasir [26] and
KvasirVQA [27] datasets. These datasets consist of gastrointestinal (GI) endoscopy images covering a wide
range of anatomical sites and pathological findings, making them suitable for multimodal tasks such
as visual question answering (VQA) and image captioning. Examples from the dataset can be seen in
Figure 1.</p>
      <p>For subtask 1, the development dataset was based on Kvasir-VQA, which contains over 6,500 GI
images annotated with visual questions and corresponding answers. The questions span multiple
types—Yes/No, Single-Choice, Multiple-Choice, Color, Location, and Count—designed to evaluate a
model’s capabilities in classification, reasoning, spatial localization, and attribute recognition. Each
image was annotated with one or more questions to ensure multimodal diversity and support a range
of inference challenges. The dataset reflects clinically relevant scenarios, helping models generalize
to real-world diagnostic tasks. The test dataset for subtask 1 was drawn from a custom, unreleased
set of GI images. These were sampled from a combination of diferent sources not included in the
development set, ensuring that the test data was distributionally distinct and unseen. This was done to
better evaluate the generalization performance of participating systems under realistic conditions.</p>
      <p>For subtask 2, participants were provided with a set of over 2,000 image-caption pairs. These were
curated to reflect clinically meaningful descriptions of GI endoscopy images, with captions written to
summarize findings such as anatomical features, abnormalities, or procedural contexts. To supplement
the limited size of the manually annotated caption dataset, a set of additional synthetic captions was
released. These were generated using large language models and rule-based methods to provide a
diverse range of phrasings and improve the efectiveness of model fine-tuning. The synthetic data
aimed to introduce variation and reduce overfitting on the manually annotated samples. As with the
VQA task, the captioning test set was drawn from a secret, mixed-source dataset that was distinct from
the development data.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Tasks and Evaluation</title>
      <p>This years, MedVQA-GI is made up of two subtasks: answering clinical questions from GI images and
generating synthetic GI images from prompts.</p>
      <sec id="sec-3-1">
        <title>3.1. Subtask 1: Question Interpretation and Response</title>
        <p>This subtask requires participants to submit ML models capable of answering questions based on
gastrointestinal (GI) images from the Kvasir-VQA dataset [27]. The dataset consists of 6, 500 annotated
images representing a range of anatomical sites, pathological conditions, and endoscopic tools. Each
image is paired with a clinical question that falls into one of six categories: Yes/No, Single-Choice,
Multiple-Choice, Color-Related, Location-Related, and Numerical Count. These categories require
models that can handle both fine-grained visual recognition and contextual understanding of medical
language. Questions may require identifying instruments, estimating quantities (like number of polyps),
recognizing colors (like bleeding or bile), or locating anatomical features. Model performance is assessed
using standard natural language generation metrics:
METEOR [28] Evaluates text generation by aligning predicted and reference outputs based on exact,
stem, synonym, and paraphrase matches.</p>
        <p>ROUGE (1/2/L) [29] A set of metrics for comparing overlapping n-grams between generated and
reference texts. ROUGE-1 and ROUGE-2 measure unigram and bigram overlap, respectively,
while ROUGE-L captures the longest common subsequence.</p>
        <p>BLEU [30] Measures n-gram precision between generated and reference texts, with a brevity penalty
to penalize overly short outputs.</p>
        <p>All models were submitted and validated through a Hugging Face-hosted repository2. This setup
ensures fair comparison, reproducibility, and allowed participants to view there standing in a publicly
registered leaderboard.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Subtask 2: Synthetic Image Generation</title>
        <p>This subtask involves generating synthetic gastrointestinal (GI) images from structured clinical prompts,
aiming to mimic the visual and diagnostic complexity of real-world endoscopic imagery. These synthetic
outputs are intended to support AI development by enhancing data availability while minimizing
dependence on sensitive patient data. Prompts provided to participants detailed anatomical sites,
pathological cues, and procedural contexts. The challenge was to synthesize images that closely align
with these clinical descriptions while maintaining variability and realism.</p>
        <p>To assess model performance, we employed both automatic and expert-driven evaluations. Automated
assessment was conducted using four quantitative metrics designed specifically for the medical imaging
domain [31]. Each metric was computed using BiomedCLIP [32] image embeddings to ensure clinical
relevance:
Fidelity Quantifies visual realism by comparing each generated image to its real counterpart. It is
defined as:</p>
        <p>Fidelity =</p>
        <p>1000
1 + mean-FID(, )
where  and  denote the BiomedCLIP features of generated and real images for prompt . A
higher score reflects closer alignment with real images.</p>
        <p>Agreement Measures semantic and visual consistency between images produced from original
prompts and their reworded variants. Computed as the mean cosine similarity:</p>
        <p>Agreement =</p>
        <p>1 ∑︁

1
∑︁</p>
        <p>· 
=1 |||| ∈,∈ ‖‖‖‖
where  and  are the BiomedCLIP embedding sets of images from the original and rephrased
prompts, respectively.
1 ∑︁ pdist()
with  representing the normalized embedding set per prompt and pdist indicating the average
pairwise distance function.</p>
        <p>Fréchet BiomedCLIP Distance (FBD) Evaluates global distributional alignment between the full
sets of synthetic and real images using the Fréchet distance:</p>
        <p>FBD = ‖ gen −  real‖2 + Tr Σ gen + Σ real − 2(Σ genΣ real)1/2)︁</p>
        <p>︁(
where  and Σ refer to the mean and covariance of BiomedCLIP features. Lower FBD values
indicate better overall realism.</p>
        <p>In addition to these automated measures, expert raters assessed the clinical plausibility and diagnostic
utility of the generated outputs. Like Subtask 1, all submissions were hosted on our public Hugging
Face repository.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Participation</title>
      <p>In total, 45 teams registered for Subtask 1 and 44 for Subtask 2, representing an increase in registrations
compared to last year. Of these, 5 teams submitted runs, and 5 teams submitted working notes papers [33,
34, 34, 35, 36, 37]. Table 1 shows an overview of the participants and the number of submissions to
each sub-task, alongside the number of participants from last year’s challenge. As in previous years, we
observed that many who registered did not submit, which is a common pattern. However, we also saw an
increase in the number of actual submissions compared to last year. This suggests growing engagement
among those who proceed past registration. Future editions could still benefit from improved outreach
and support, such as tutorials or "getting started" scripts, to make it easier to participate.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>Five teams submitted runs to Subtask 1, while three teams participated in Subtask 2. Below, we briefly
describe each team’s submission and their approach. The overall results are presented in Table 2 for
Subtask 1 and Table 3 for Subtask 2.</p>
      <sec id="sec-5-1">
        <title>5.1. Team Sagarmatha Rangers</title>
        <p>Team Sagarmatha Rangers [33] participated in Subtask 1 and used Florence-2 as their base model,
finetuned on the challenge development dataset. They incorporated domain-specific image augmentations
like flipping, jitter, and cropping, and embedded location tokens in the prompt to enhance spatial
understanding. Training was conducted using LoRA, and evaluation showed that augmentations
improved performance.</p>
        <p>Team
UPS
UPS
IReL_IIT_BHU
IReL_IIT_BHU
MedPixel
MedPixel
CS_Morgan_Lab
CS_Morgan_Lab
Sagarmatha_Rangers</p>
        <p>Sagarmatha_Rangers</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Team CS_Morgan Lab</title>
        <p>Team CS_Morgan Lab [34] participated in both Subtask 1 and Subtask 2. For Subtask 1, they fine-tuned
a BLIP2-Flan-T5 model with a ViT-G encoder using a causal language modeling approach. Training was
done for 3 epochs on an A100 GPU with AdamW and a batch size of 32. They added post-processing to
normalize answer outputs. While comparing with zero-shot models like MiniGPT-4 and LLaVA-1.5,
their fine-tuned BLIP2 model achieved superior scores across BLEU, ROUGE, and METEOR metrics. For
Subtask 2, they used Stable Difusion v1.5 fine-tuned with LoRA on image-caption pairs from the GI
domain. The model was trained using DreamBooth techniques to improve prompt-image alignment.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Team MedPixel</title>
        <p>Team MedPixel [35] participated in both Subtask 1 and Subtask 2. For Subtask 1, they fine-tuned
the Florence2-0.3B model on the Kvasir-VQA dataset using LoRA and optimized hyperparameters
through Bayesian search with Optuna. Training was conducted on an RTX A4000 GPU, using gradient
accumulation to simulate larger batch sizes. Their best model (batch size 64, LoRA rank 16) achieved
strong performance, with a METEOR score of 0.48 and ROUGE-L of 0.86 on the private test set. For
Subtask 2, they fine-tuned Stable Difusion v2.1 using LoRA to synthesize GI endoscopy images from
structured prompts.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Team IReL, IIT (BHU)</title>
        <p>Team IReL, IIT (BHU) [36] participated in both Subtask 1 and Subtask 2. For Subtask 1, they fine-tuned
the Florence2 model on Kvasir-VQA, using image preprocessing to remove specular highlights and
black borders. Training was performed on a single H100 GPU using AdamW and fp16 precision. For
Subtask 2, they fine-tuned Stable Difusion v2-1 using LoRA on four L40 GPUs with synthetic prompts
and images at 768 × 768 resolution. They selected v2-1 based on its better trade-of between quality
and compute.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Team UPS</title>
        <p>Team UPS [37] participated in Subtask 1 and explored two approaches: a multimodal Chain-of-Thought
(CoT) reasoning method and fine-tuning of generative models. The CoT method used Qwen2-VL to
generate rationales before predicting answers through two-stage prompting. In contrast, the
finetuning strategy involved training BLIP2-Flan-T5-XL and Qwen2-VL using cross-entropy loss with LoRA.
Models were trained for 10 epochs on a single A100 GPU. The fine-tuned BLIP2 model performed best
out of all other configurations, achieving the best scores.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>In Subtask 1, most teams adopted transformer-based multimodal architectures, with Florence2 being
most common. Fine-tuning was typically performed using LoRA, together with hyperparameter
optimization and input augmentation. In addition, methods incorporating structured prompting or
chain-of-thought reasoning showed potential, particularly for complex question categories that required
spatial reasoning or numerical inference, such as location-based and counting tasks. In Subtask 2, three
teams submitted models based on fine-tuned variants of Stable Difusion. Similar to Subtask 1, LoRA
was the primary fine-tuning strategy. While all models demonstrated high visual fidelity in image
generation, the degree of alignment between prompts and outputs varied. Quantitative evaluation using
FBD and prompt-image consistency metrics indicated that current methods are still limited in their
ability to generate clinically accurate content. In several cases, generated images appeared realistic but
failed to reproduce specific anatomical or pathological features described in the input prompts.</p>
      <p>Although we received more submissions this year than last, the number of final submissions is still
low compared to the number of registered. This suggests that both subtasks still pose technical and
resource challenges. Subtask 2 in particular may benefit from additional baseline models, simplified
starter code, and clearer guidelines on expected output structure. The evaluation setup, while automated
and reproducible, may need to be complemented with more qualitative human review.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This paper presented the 2025 edition of the MedVQA-GI challenge, held as part of ImageCLEF. The
challenge included two sub-tasks focused on medical VQA and the generation of synthetic
gastrointestinal images. For future editions, we aim to refine and expand the task by providing more comprehensive
resources to support participants in getting started.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4o for grammar and spelling checks,
paraphrasing and rewording, and improving the writing style. After using this tool, the authors
reviewed and edited the content as needed and take full responsibility for the publication’s content.
//doi.org/10.1038/s41598-019-39416-7.
[17] N. M. Ghatwary, X. Ye, M. Zolgharni, Esophageal abnormality detection using densenet based
faster r-cnn with gabor features, IEEE Access 7 (2019) 84374–84385. doi:https://doi.org/10.
1109/ACCESS.2019.2925585.
[18] S. Shah, N. Park, N. E. H. Chehade, A. Chahine, M. Monachese, A. Tiritilli, Z. Moosvi, R. Ortizo,
J. Samarasena, Efect of computer-aided colonoscopy on adenoma miss rates and polyp detection:
a systematic review and meta-analysis, Journal of Gastroenterology and Hepatology 38 (2023)
162–176.
[19] S. Hicks, M. Riegler, P. Smedsrud, T. B. Haugen, K. R. Randel, K. Pogorelov, H. K. Stensland, D.-T.</p>
      <p>Dang-Nguyen, M. Lux, A. Petlund, T. de Lange, P. T. Schmidt, P. Halvorsen, Acm multimedia
biomedia 2019 grand challenge overview, in: Proceedings of the ACM International Conference
on Multimedia (ACM MM), 2019, pp. 2563–2567. doi:https://doi.org/10.1145/3343031.
3356058.
[20] K. Pogorelov, M. Riegler, P. Halvorsen, S. A. Hicks, K. R. Randel, D.-T. Dang-Nguyen, M. Lux,
O. Ostroukhova, T. De Lange, Medico multimedia task at mediaeval 2018, in: Proceeding of the
MediaEval Benchmarking Initiative for Multimedia Evaluation Workshop (MediaEval), 2018.
[21] M. Riegler, K. Pogorelov, P. Halvorsen, K. Randel, S. Eskeland, D.-T. Dang-Nguyen, M. Lux, C.
Griwodz, C. Spampinato, T. de Lange, Multimedia for medicine: the medico task at mediaeval 2017,
in: Proceeding of the MediaEval Benchmarking Initiative for Multimedia Evaluation Workshop
(MediaEval), 2017.
[22] J. Bernal, H. Aymeric, Miccai endoscopic vision challenge polyp detection and segmentation,
https://endovissub2017-giana.grand-challenge.org/home/, 2017. Accessed: 2017-12-11.
[23] S. Hicks, M. Riegler, P. Smedsrud, T. B. Haugen, K. R. Randel, K. Pogorelov, H. K. Stensland, D.-T.</p>
      <p>Dang-Nguyen, M. Lux, A. Petlund, T. de Lange, P. T. Schmidt, P. Halvorsen, Acm multimedia
biomedia 2019 grand challenge overview, in: Proceedings of the 27th ACM International
Conference on Multimedia, MM ’19, Association for Computing Machinery, New York, NY, USA, 2019, p.
2563–2567. URL: https://doi.org/10.1145/3343031.3356058. doi:10.1145/3343031.3356058.
[24] V. Thambawita, P. Salehi, S. A. Sheshkal, S. A. Hicks, H. L. Hammer, S. Parasa, T. d. Lange,
P. Halvorsen, M. A. Riegler, Singan-seg: Synthetic training data generation for medical image
segmentation, PLOS ONE 17 (2022) 1–24. URL: https://doi.org/10.1371/journal.pone.0267976.
doi:10.1371/journal.pone.0267976.
[25] D. Yoon, H.-J. Kong, B. S. Kim, W. S. Cho, J. C. Lee, M. Cho, M. H. Lim, S. Y. Yang, S. H. Lim,
J. Lee, J. H. Song, G. E. Chung, J. M. Choi, H. Y. Kang, J. H. Bae, S. Kim, Colonoscopic image
synthesis with generative adversarial network for enhanced detection of sessile serrated lesions
using convolutional neural network, Sci Rep 12 (2022) 261.
[26] H. Borgli, V. Thambawita, P. H. Smedsrud, S. Hicks, D. Jha, S. L. Eskeland, K. R. Randel, K. Pogorelov,
M. Lux, D. T. D. Nguyen, et al., Hyperkvasir, a comprehensive multi-class image and video dataset
for gastrointestinal endoscopy, Scientific data 7 (2020). doi: 10.1038/s41597-020-00622-y.
[27] S. Gautam, A. Storås, C. Midoglu, S. A. Hicks, V. Thambawita, P. Halvorsen, M. A. Riegler,
Kvasirvqa: A text-image pair gi tract dataset, in: Proceedings of the First International Workshop
on Vision-Language Models for Biomedical Applications (VLM4Bio), ACM, 2024, p. 10 pages.
doi:10.1145/3689096.3689458.
[28] S. Banerjee, A. Lavie, METEOR: An automatic metric for MT evaluation with improved correlation
with human judgments, in: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation
Measures for Machine Translation and/or Summarization, Association for Computational
Linguistics, Ann Arbor, Michigan, 2005, pp. 65–72. URL: https://www.aclweb.org/anthology/W05-0909.
[29] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization
Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL:
https://aclanthology.org/W04-1013/.
[30] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
translation, in: Proceedings of the 40th Annual Meeting on Association for Computational
Linguistics, ACL ’02, Association for Computational Linguistics, USA, 2002, p. 311–318. URL:
https://doi.org/10.3115/1073083.1073135. doi:10.3115/1073083.1073135.
[31] M. Chaichuk, S. Gautam, S. Hicks, E. Tutubalina, Prompt to Polyp: Medical Text-Conditioned
Image Synthesis with Difusion Models, arXiv (2025). doi: 10.48550/arXiv.2505.05573.
arXiv:2505.05573.
[32] S. Zhang, Y. Xu, N. Usuyama, H. Xu, J. Bagga, R. Tinn, S. Preston, R. Rao, M. Wei, N. Valluri, et al.,
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific
image-text pairs, arXiv (2023). doi:10.48550/arXiv.2303.00915. arXiv:2303.00915.
[33] S. Gaihre, A. Thapa, P. Pokhrel, L. Tiwari, Multimodal ai for gastrointestinal diagnostics: Tackling
vqa in imageclefmed-medvqa-gi 2025, in: CLEF2025 Working Notes, CEUR Workshop Proceedings,
CEUR-WS.org, Madrid, Spain, 2025.
[34] E. P. O. Oluwafemi, M. Hoque, E. F. Akor, R. N. Chowdhury, A. Umar, M. M. Rahman, Solving
medical data limitations through ai: Multi-modal vision-language learning for gastrointestinal
vqa and synthetic training data generation, in: CLEF2025 Working Notes, CEUR Workshop
Proceedings, CEUR-WS.org, Madrid, Spain, 2025.
[35] G. Parajuli, Querying gi endoscopy images: A vqa approach, in: CLEF2025 Working Notes, CEUR</p>
      <p>Workshop Proceedings, CEUR-WS.org, Madrid, Spain, 2025.
[36] K. Tewari, S. Pal, Bridging vision and language in gi diagnosis: Florence2 for question
answering and stable difusion for image synthesis, in: CLEF2025 Working Notes, CEUR Workshop
Proceedings, CEUR-WS.org, Madrid, Spain, 2025.
[37] O. Adjali, Towards better gastrointestinal diagnosis: Evaluating vision-language models for gi vqa,
in: CLEF2025 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Madrid, Spain, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Spadaccini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Iannone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Maselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jovani</surname>
          </string-name>
          , V. T. Chandrasekar, G. Antonelli,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Areia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dinis-Ribeiro</surname>
          </string-name>
          , et al.,
          <article-title>Performance of artificial intelligence in colonoscopy for adenoma and polyp detection: a systematic review and meta-analysis</article-title>
          ,
          <source>Gastrointestinal endoscopy 93</source>
          (
          <year>2021</year>
          )
          <fpage>77</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Alammari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Tavanapong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. C. De Groen</surname>
          </string-name>
          ,
          <article-title>Classification of ulcerative colitis severity in colonoscopy videos using cnn</article-title>
          ,
          <source>in: Proceedings of the ACM International Conference on Information Management and Engineering (ACM ICIME)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>139</fpage>
          -
          <lpage>144</lpage>
          . doi:https://doi.org/10.1145/3149572.3149613.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>