<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Querying GI Endoscopy Images: A VQA Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gaurav Parajuli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Johannes Kepler Universität Linz</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>VQA (Visual Question Answering) combines Natural Language Processing (NLP) with image understanding to answer questions about a given image. It has enormous potential for the development of medical diagnostic AI systems. Such a system can help clinicians diagnose gastro-intestinal (GI) diseases accurately and eficiently. Although many of the multimodal LLMs available today have excellent VQA capabilities in the general domain, they perform very poorly for VQA tasks in specialized domains such as medical imaging. This study is a submission for ImageCLEFmed-MEDVQA-GI 2025 subtask 1 that explores the adaptation of the Florence2 model to answer medical visual questions on GI endoscopy images. We also evaluate the model performance using standard metrics like ROUGE, BLEU and METEOR. The code used in the experiments is publicly available at: github.com/gauravparajuli/ImageCLEFmed-MEDVQA-GI-2025-Task1.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Medical VQA</kwd>
        <kwd>Florence-2</kwd>
        <kwd>LoRA</kwd>
        <kwd>ImageCLEFmed 2025</kwd>
        <kwd>Multimodal AI</kwd>
        <kwd>Supervised Fine-tuning</kwd>
        <kwd>Clinical Question Answering</kwd>
        <kwd>Gastrointestinal Imaging</kwd>
        <kwd>Kvasir-VQA</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Numerous eforts have been made to advance the field of MEDVQA. Since 2018, the annual ImageCLEF
MEDVQA benchmark has played a critical role in pushing the field forward. Previously, MEDVQA
methods combined CNNs and BERT with attention-based fusion for radiology images [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ]. In
addition, the datasets lacked focus on GI endoscopy. However, with the introduction of the KVASIR [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
dataset and the recently introduced KVASIR-VQA [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], this issue has been mitigated. Current datasets
provide question–answer pairs on GI tract images [12]. With the emergence of transformer-based
models that have demonstrated strong performance in various deep learning tasks, this work focuses
on fine-tuning Florence2 using the KVASIR-VQA dataset to enable efective VQA on GI endoscopic
images.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Overview and Dataset</title>
      <p>We participated in subtask 1 (VQA) of the ImageCLEFmed 2025 [13] challenge. The objective of this
challenge was to develop a model capable of accurately interpreting and answering all the questions
related to GI images.</p>
      <p>For this task, we used the KVASIR-VQA dataset. The Kvasir-VQA dataset is a combination of
HyperKvasir [14] and KvasirInstrument [15]. It consists of 6500 images across five diferent categories:</p>
      <p>Each of these 6500 images in KVASIR-VQA is accompanied by various question–answer pairs. There
are a total of six question types in the dataset.
1. Yes/No Example: Does this image contain any finding?
2. Single choice Example: What type of polyp is taken?
3. Multiple choice Example: Are there any anatomical landmarks in the images?
4. Choice(Color) Example: What color is the abnormality?
5. Location Example: Where in the image is the abnormality?
6. Numerical counting Example: How many polyps are there in the image?</p>
      <p>Since about 58,800 question–answer pairs were available for the 6500 images, no attempt was made
to augment the dataset by paraphrasing question–answer pairs.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Architecture Overview</title>
        <p>In this study, we fine-tuned the Florence2 model from Microsoft on KVASIR-VQA for medical VQA.
Florence2 uses prompt-based multitask learning, which enables it to perform a diverse set of tasks such
as object detection, segmentation and image captioning without the need for task-specific heads [ 16].
The Florence2 architecture consists of (a) a vision encoder (DaViT) that converts images into visual token
embeddings, (b) a text encoder that processes prompt-style questions, (c) a multimodal transformer
encoder–decoder that fuses image and text tokens and (d) a generative output that follows autoregressive
decoding schemes like other LLMs.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Training Details</title>
        <p>
          No augmentations were made to the training dataset provided by the organizer. The dataset was divided
(using a seed value) into training and validation sets in a 9:1 ratio. We used LoRA (Low-Rank Adaptation)
adapters for fine-tuning. Training was carried out for 5 epochs (early stopping patience set to 2 epochs)
with evaluation at the end of each epoch on an NVIDIA RTX A4000 16GB GPU. We used Weights and
Biases (wandb) to track training progress.
4.2.1. Hyperparameters Search
In order to determine the optimal hyperparameters, we used Optuna with Bayesian optimization to
perform a hyperparameter search. We performed 100 search trials (one epoch each) using only 2.5%
of the actual dataset. Across all trials, the seed was fixed for reproducibility. Hyperparameter search
ranges for the trials were: (1) learning rate: [1e-6, 1e-4], (2) batch size per device: [
          <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
          ], (3) gradient
accumulation steps: [
          <xref ref-type="bibr" rid="ref1 ref2 ref4">1, 2, 4</xref>
          ], (4) weight decay: [0.0, 0.1], (5) LoRA rank: [
          <xref ref-type="bibr" rid="ref4 ref8">4, 8, 16</xref>
          ], (6) LoRA alpha: [
          <xref ref-type="bibr" rid="ref8">8,
16, 32</xref>
          ] and (7) LoRA dropout: [0.0, 0.3]. After 100 search trials, the best configuration found was: (a)
learning rate: 9.59e-5, (b) batch size per device: 2, (c) gradient accumulation steps: 2, (d) weight decay:
0.071, (e) LoRA rank: 16, (f) LoRA alpha: 32, (g) LoRA dropout: 0.05478.
        </p>
        <p>However, the initial training with the above hyperparameters on the entire training set led to gradient
explosion. Therefore, the learning rate obtained from the search was scaled down by a factor of four
(referred to as the base learning rate from now on). Finally, to reduce the noise in the training loss
curve, the efective batch size was increased to 64 via gradient accumulation and the base learning rate
was scaled as:
√︃ new_batch_size</p>
        <p>old_batch_size
Rest of the hyperparameters were kept as per the hyperparameters search result.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Evaluation</title>
      <p>The following table outlines the performance of our best model on both the public and private test sets
of the organizing team.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Ablation Studies</title>
      <p>To understand the impact of diferent fine-tuning strategies and the role of diferent components, we
performed the following ablation studies using four variants of the Florence2 model.
Please note that the model variant is in the format baseModelName_batchSize_trainingStrategy.
Model weights for the above variants are available at: https://www.hf.co/gauravparajuli/model_variant_name
For the first variant, we froze the vision tower in the Florence2 architecture and proceeded with
training. For the second variant, we froze the encoder portion of the Florence2 language model in
addition to the vision tower. For the third and fourth variants, we used LoRA adapters with rank=8,
alpha=16 and rank=16, alpha=32 respectively.</p>
      <p>Here we can clearly see that the LoRA-based variant with rank=16 outperforms all variants. The
LoRA-based variant with rank=8 achieves the highest BLEU score, possibly due to a regularization
efect from the reduced parameter count, but it slightly underperforms on all other metrics.</p>
      <p>The second variant, in which both the encoder and the vision tower were frozen, performs slightly
under the first variant, in which only the vision tower was frozen. This suggests that while fine-tuning
the decoder alone can capture some useful adaptation, the encoder’s contribution is crucial for optimal
performance in the multimodal task.</p>
      <p>In general, these results validate the efectiveness of LoRA-based fine-tuning on downstream tasks.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion and Comparison with Literature</title>
      <p>Our approach outshines the performance of the previous year’s baseline, which had a ROUGE1 score
of 0.6955. However, despite achieving a strong ROUGE score, our model performed worse than the
previous year’s baseline in terms of the BLEU score (0.21 vs. 0.3757). The previous baseline was trained
on only 2000 images (a small subset). As the BLEU metric penalizes short candidates, it is plausible that
our model, which possibly generated more diverse and concise answers due to greater data exposure,
was disproportionately penalized.</p>
      <p>Also, Kvasir-VQA contains several question types. For "Yes/no" question and single word answer
question, higher ROUGE score is easier to achieve as the vocabulary is limited. However, the BLEU
score in this case will be very low or zero if the prediction does not exactly match the single word
ground truth. This is the reason why the BLEU score was zero for the majority of question types in
Table 3.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion and Future Work</title>
      <p>It is evident from this study that it is possible to train a robust MEDVQA system based on Kvasir-VQA.
Notable future directions for this study include:
1. Expanding the current dataset by introducing more diverse images and question answer pairs. This
might help in developing more robust and generalizable models.
2. Increase the performance of the model on the BLEU metric without sacrificing the performance on
the ROUGE metric. This could involve exploring diferent decoding strategies and loss functions that
encourage more grammatically correct sentences.
3. Extensive evaluation of the model in real world clinical settings to gauge potential for real world
applications.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>We would like to thank the organizers from the SimulaMet Department of Holistic Learning who made
this event possible. This work heavily relied on the KVASIR-VQA dataset, which was also compiled
by SimulaMet, and we deeply appreciate their contribution. Additionally, we would like to thank the
researchers from Microsoft for their Florence-2 model. Lastly, this work would not have been possible
without Hugging Face. We heartily thank everyone who has contributed to the Transformers library
and the PEFT library within the Hugging Face ecosystem.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT for grammar and spelling checks.
After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s)
full responsibility for the publication’s content.
[12] S. Gautam, M. A. Riegler, P. Halvorsen, Kvasir-VQA-x1: A Multimodal Dataset for Medical
Reasoning and Robust MedVQA in Gastrointestinal Endoscopy, arXiv (2025). doi:10.48550/
arXiv.2506.09958. arXiv:2506.09958.
[13] B. Ionescu, H. Müller, D.-C. Stanciu, A. Idrissi-Yaghir, A. Radzhabov, A. G. S. de Herrera, A. Andrei,
A. Storås, A. B. Abacha, B. Bracke, B. Lecouteux, B. Stein, C. Macaire, C. M. Friedrich, C. S. Schmidt,
D. Fabre, D. Schwab, D. Dimitrov, E. Esperança-Rodier, G. Constantin, H. Becker, H. Damm,
H. Schäfer, I. Rodkin, I. Koychev, J. Kiesel, J. Rückert, J. Malvehy, L.-D. S, tefan, L. Bloch, M. Potthast,
M. Heinrich, M. A. Riegler, M. Dogariu, N. Codella, P. H. P. Nakov, R. Brüngel, R. A. Novoa, R. J. Das,
S. A. Hicks, S. Gautam, T. M. G. Pakull, V. Thambawita, V. Kovalev, W.-W. Yim, Z. Xie, Overview
of imageclef 2025: Multimedia retrieval in medical, social media and content recommendation
applications, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction, Proceedings
of the 16th International Conference of the CLEF Association (CLEF 2025), Springer Lecture Notes
in Computer Science LNCS, Madrid, Spain, 2025.
[14] H. Borgli, V. Thambawita, P. H. Smedsrud, S. Hicks, D. Jha, S. L. Eskeland, K. R. Randel, K. Pogorelov,
M. Lux, D. T. D. Nguyen, et al., HyperKvasir, a comprehensive multi-class image and video dataset
for gastrointestinal endoscopy, Sci. Data 7 (2020) 1–14. doi:10.1038/s41597-020-00622-y.
[15] D. Jha, S. Ali, K. Emanuelsen, S. A. Hicks, V. Thambawita, E. Garcia-Ceja, M. A. Riegler, T. De Lange,
P. T. Schmidt, H. D. Johansen, et al., Kvasir-instrument: Diagnostic and therapeutic tool
segmentation dataset in gastrointestinal endoscopy, in: MultiMedia Modeling: 27th International
Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part II 27, Springer,
2021, pp. 218–229.
[16] B. Xiao, H. Wu, W. Xu, X. Dai, H. Hu, Y. Lu, M. Zeng, C. Liu, L. Yuan, Florence-2: Advancing a
unified representation for a variety of vision tasks, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2024, pp. 4818–4829.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Gelberg</surname>
          </string-name>
          ,
          <article-title>Pathophysiological mechanisms of gastrointestinal toxicity</article-title>
          , Comprehensive
          <string-name>
            <surname>Toxicology</surname>
          </string-name>
          (
          <year>2017</year>
          )
          <fpage>139</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. B.</given-names>
            <surname>Navarre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pugh</surname>
          </string-name>
          ,
          <article-title>Diseases of the gastrointestinal system</article-title>
          , Sheep &amp; Goat
          <string-name>
            <surname>Medicine</surname>
          </string-name>
          (
          <year>2009</year>
          )
          <fpage>69</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Arnold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Abnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Neale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vignat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Giovannucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. A.</given-names>
            <surname>McGlynn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bray</surname>
          </string-name>
          ,
          <article-title>Global burden of 5 major types of gastrointestinal cancer</article-title>
          ,
          <source>Gastroenterology</source>
          <volume>159</volume>
          (
          <year>2020</year>
          )
          <fpage>335</fpage>
          -
          <lpage>349</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O. F.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Soares</surname>
          </string-name>
          , E. Mazomenos,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brandao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Seward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. B.</given-names>
            <surname>Lovat</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence and computer-aided diagnosis in colonoscopy: current evidence and future directions</article-title>
          ,
          <source>The lancet Gastroenterology &amp; hepatology 4</source>
          (
          <year>2019</year>
          )
          <fpage>71</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Berzin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R. G.</given-names>
            <surname>Brown</surname>
          </string-name>
          , S. Bharadwaj,
          <string-name>
            <given-names>A.</given-names>
            <surname>Becq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Real-time automatic detection system increases colonoscopic polyp and adenoma detection rates: a prospective randomised controlled study</article-title>
          ,
          <source>Gut</source>
          <volume>68</volume>
          (
          <year>2019</year>
          )
          <fpage>1813</fpage>
          -
          <lpage>1819</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chaichuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gautam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hicks</surname>
          </string-name>
          , E. Tutubalina, Prompt to Polyp:
          <article-title>Medical Text-Conditioned Image Synthesis with Difusion Models, arXiv (</article-title>
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2505.05573. arXiv:
          <volume>2505</volume>
          .
          <fpage>05573</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          , L. Gu, Zhejiang university at imageclef 2019 visual
          <article-title>question answering in the medical domain</article-title>
          ,
          <source>in: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2019</year>
          . URL: https://www.imageclef.org/2019/medical/vqa, team Hanlin.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ben Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sarrouti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <article-title>Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          , volume
          <volume>2936</volume>
          ,
          <year>2021</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2936</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ben Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Datla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <article-title>Vqa-med: Overview of the medical visual question answering task at imageclef 2019</article-title>
          ,
          <source>in: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2019</year>
          . URL: https://github.com/abachaa/ VQA-Med-
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Randel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Griwodz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Eskeland</surname>
          </string-name>
          , T. de Lange,
          <string-name>
            <given-names>D.</given-names>
            <surname>Johansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Spampinato</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.-T.</surname>
            Dang-Nguyen,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lux</surname>
            ,
            <given-names>P. T.</given-names>
          </string-name>
          <string-name>
            <surname>Schmidt</surname>
          </string-name>
          , et al.,
          <article-title>Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection</article-title>
          ,
          <source>in: Proceedings of the 8th ACM on Multimedia Systems Conference</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>164</fpage>
          -
          <lpage>169</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gautam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Storås</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Midoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Thambawita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <article-title>Kvasir-VQA: A Text-Image Pair GI Tract Dataset</article-title>
          , in: ACM Conferences,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2024</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .1145/3689096.3689458.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>