<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SSN MLRG at MEDVQA-GI 2023: Visual Question Generation and Answering using Transformer based Pre-trained Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sheerin Sitara Noor Mohamed</string-name>
          <email>sheerinsitaran@ssn.edu.in</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kavitha Srinivasan</string-name>
          <email>kavithas@ssn.edu.in</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raghuraman Gopalsamy</string-name>
          <email>raghuramang@ssn.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of CSE, Sri Sivasubramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Kalavakkam - 603110</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The technological development in current era demands the need of Artificial Intelligence (AI) in all fields. The AI in medical field is not an exception for various real time applications as per user demands. The applications are medical report summarization, image captioning, Visual Question Answering (VQA) and Visual Question Generation (VQG). ImageCLEF is one of the forum which constantly conducing the challenges in these applications. In this paper, for the given MEDVQA-GI dataset, three medical VQA and one medical VQG models are proposed. The medical VQA models are developed using VisionTransformer (ViT), SegFormer and VisualBERT techniques through a combination of eighteen QA-pairs based on categories and resulted an accuracy of 95.6%, 95.7% and 62.4% respectively. Also, the proposed medical VQG model is developed using Category based Medical Visual Question Generation (CMVQG) technique only.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The Medical Visual Question Answering and Generation is an challenging field in Natural
Language Processing (NLP) and Computer Vision because of the complex nature of both image and
text. The ImageCLEF [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] , an online evaluation forum analysis the current trends and conducing the
research related tasks since 2018. In 2018 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and 2019 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], they concentrate on visual questions related
to different organs, planes, modalities and abnormalities. Then in 2020 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and 2021 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], ImageCLEF
concentrated on abnormality type questions alone based on the inferences made from previous two
years. This year [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], they are conducing a task for colonoscopy based Visual Question Answering
(VQA) and Visual Question Generation (VQG).
      </p>
      <p>
        In VQA and VQG datasets given by ImageCLEF is based on the HyperKvasir dataset and Kvasir
Instrument dataset. These datasets are used to develop proposed models using suitable techniques and
️© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
evaluated performance metrics, are discussed in the following paragraphs. The VQA approaches are:
joint embedding [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], hybrid, compositional and transformer based techniques. Among these techniques,
transformer based techniques is chosen because it hold the potential to understand the relationship
between sequential elements and performs parallel processing more quickly. The different transformer
based techniques used in this paper are, VisionTransformer (ViT) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], SegFormer [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], VisualBERT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
and Category based Medical Visual Question Generation (CMVQG for VQG). The reason behind
choosing these techniques are, (i). ViT incorporates more global information then other pre-trained
model at lower layers, leading to quantitatively different features (ii). VisualBERT is a simple and
flexible framework for modelling a broad range of vision and language tasks (iii). SegFormer has its
own advantage over the speed, accuracy, number of parameters (iv). The CMVQG generates the
questions based on the category instead of answer in the VQA dataset.
      </p>
      <p>The rest of the paper is organized as follows. Section 2 explains about the MEDVQA-GI 2023 task
and dataset description. Section 3 briefs the system design of the proposed VQA and VQG models.
Section 4 analyses the results using suitable quantitative metric, and Section 5 concludes with the future
work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task and Dataset Description</title>
      <p>In this section, two sub tasks of MEDVQA-GI 2023 and given dataset is explained. The two sub
tasks includes, Visual Question Answering (VQA) and Visual Question Generation (VQG).
2.1.</p>
    </sec>
    <sec id="sec-3">
      <title>ImageCLEF MEDVQA-GI 2023 task</title>
      <p>ImageCLEF, a part of Conference and Labs of the Evaluation Forum is conducting tasks related to
the medical domain since 2018. Its goal is that through the combination of text and image data the
output of the analysis gets easier to use by medical experts.</p>
      <p>In sub task A (VQA), the answer to the colonoscopy image needs to be generated with respect to
the given the question-answer pairs. For example, given the image containing the colon polyp with the
question, “What type of polyp is present?”. Then the answer should be textual description of the type
of the polyp located in the image.</p>
      <p>In sub task B (VQG), the question to the colonoscopy image need to be generated based on the
significant informantion present in the image and answer. The significant information includes,
location, count, color, size, shape of the polyps, modality, abnormality type, etc.</p>
      <p>2.2.</p>
    </sec>
    <sec id="sec-4">
      <title>ImageCLEF MEDVQA-GI 2023 dataset</title>
      <p>The MEDVQA-GI 2023 task consists of two sub tasks namely, VQA and VQG. The image, mask
and QA-pairs for both tasks are given in Table 1. Each sub task consists of training set and test set. In
these tasks, 18 QA-pairs are associated with each image so the count of QA-pairs is eighteen times the
number of images. These questions are tabulated along with the frequent answers for each question and
categories in Table 2. Based on these categories, VQA and VQG model is generated and it is discussed
in Section 3.</p>
      <p>Test Set
1949</p>
      <p>Location oriented QA-pairs</p>
    </sec>
    <sec id="sec-5">
      <title>3. System Design</title>
      <p>The system design of the proposed medical VQA and VQG models are shown in Figures 1, 2, 3
and 4. For medical VQA, three models are developed using VisionTransformer (ViT), SegFormer and
VisualBERT based on its categories and one medical VQG model created using Category based
Medical Visual Question Generation (CMVQG) techniques as given in Table 3.</p>
      <p>The three VQA models are developed using the colonoscopy image and QA-pairs in the training
set and, is validated by predicting the label for the test set. The Vision Transformer (ViT) VQA model
is developed for QA pairs under classification or numeric type and is shown in Figure 1. The ViT
divides the input image into patches of 16×16 pixels and linearly projects the flattened patches. The
QA-pairs are converted into tokens using patch and position embedding. Based on the patches and
tokens, the model is trained autoregressively for predicting the next token under causal (or
unidirectional) self-attention using Multi Layer Perceptron (MLP). The model was implemented using
the Vision Encoder Decoder class from the Hugging Face Transformers library and tiny Data-efficient
image Transformer (DeiT) pre-trained on ImageNet dataset.</p>
      <p>The VisuaBERT is an extension of Bidirectional Encoder Representations from Transformers
(BERT), model an image with respect to the bounding region in the image. In VisualBERT, the tokens
and vocabulary lists are generated using position and segment embedding and it is concatenated with
image features to generate model during the training phase. Faster Region based Convolutional Neural
Networks (RCNN) is used in order to extract the features from an image and to represent the segmented
region with the bounding box. From the segmented region, the appearance features are extracted and is
then embedded with the text features to generate the model and it is shown in Figure 2.</p>
      <p>In Figure 3, the images were divided into small patches which favor the dense prediction task. These
patches are given as input to the attention layer to obtain multi-level features of the original image
resolution. It is then passed to Multi-Layer Perceptron to implicitly discover useful alignments between
both sets of inputs in terms of color and build up a new joint representation in the training phase. Finally,
the generated model is validated by predicting the answers for the color oriented questions for the given
colonoscopy image.</p>
      <p>For medical VQG model generation, Category based Medical Visual Question Generation
(CMVQG) approach is used and it is shown in Figure 4. In the proposed CMVQG approach, the
Convolutional Neural Network (CNN), Multi-Layer Perceptron (MLP) and Long Short Term Memory
(LSTM) are used in the training phase. Because, CNN is capable of extracting the image features and
to learn the internal representation of an image. MLP remembers the pattern in the sequential data and
is used to extract the text features from the given answers, questions and categories with respect to an
image. Finally, LSTM handles long term dependency for extended period of time. Following this, two
encoders are used to generate latent encoding from the features of both image as well as text. Later,
both the generated latent encoding is concatenated by passing it to the weighted MLP which generates
the corresponding latent representation. This concatenated latent representation acts as a backbone that
contains the significant information for question generation. The final model is generated by passing
this concatenated latent representation to the LSTM and it generates the question as a sequence of words
based on the previous words.</p>
    </sec>
    <sec id="sec-6">
      <title>4. Experiments and Results</title>
      <p>The hardware and software required for the implementation of VQA and VQG models includes,
(i). Intel i5 processor with NVIDIA GeForce Ti 4800, 4.3GHZ clock speed, 16GB RAM, Graphical
Processing Unit and 2TB disk space, (ii). Linux – Ubuntu 20.04 operating system, Python 3.7 package
with required libraries like tensorflow, torch, sklearn, nltk, pickle, pandas, etc.</p>
      <p>The three VQA models and one VQG model are created and validated using MEDVQA-GI 2023
dataset. The VQA models are initially is trained for 20 epochs, and finally for an additional 20 epochs
starting from the checkpoint with the lowest validation loss with an learning rate of 5×10−5. Due to
limitations of the computational resources available unable to fine tune the model using self-critical
sequence training. Then the performance of VQA models are analyzed for each question and it is given
in Table 4 and 5. In Table 4, it has been inferred that the overall accuracy is 0.471. In addition to this,
the classification type QA-pairs attained an highest accuracy of 0. 956. The highest, lowest and overall
accuracy for each category is given in Table 5 for better understanding.</p>
      <p>From Table 4, it has been inferred that, the VisualBERT and ViT maintains the reasonable accuracy
for all types of QA-pairs and hence the accuracy ranges from 40% to 60%. But SegFormer and ViT (for
classification Type QA-pairs) are question specific and hence its attains the highest accuracy of 95% as
well as the lowest accuracy of 1%.</p>
    </sec>
    <sec id="sec-7">
      <title>5. Conclusion</title>
      <p>This research experimented the category based approach to solve VQA and VQG tasks of
ImageCLEF MEDVQA-GI 2023. For this task, three VQA models such as ViT, SegFormer,
VisualBERT and CMVQG are developed and validated. From the results of the proposed models, it has
been inferred that SegFormer and ViT are more problem specific and hence the overall performance is
52.2% and 53.0% respectively which will be improved by choosing the appropriate QA-pairs categories
with respect to the medical image. On the other hand, VisualBERT are task generic so it performs
reasonably better for mostly all VQA datasets and it ranges from 48.1% to 62.4%. In the future work,
the performance can be improved by creating medical related transformer based models. The overall
performance can be improved by concentrating on the abnormality type questions.</p>
    </sec>
    <sec id="sec-8">
      <title>6. Acknowledgements</title>
      <p>Our profound gratitude to Sri Sivasubramaniya Nadar College of Engineering, Department of CSE,
for allowing us to utilize the High Performance Computing Laboratory and GPU Server for the
execution of this challenge successfully.</p>
    </sec>
    <sec id="sec-9">
      <title>7. References</title>
      <p>[11] https://keras.io/examples/vision/image_classification_with_vision_transformer/
[12] https://huggingface.co/blog/mask2former
[13] https://datasets.simula.no/hyper-kvasir/
[14] https://datasets.simula.no/kvasir-instrument/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farri</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lungren</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          (
          <year>2018</year>
          ,
          <article-title>September). Overview of ImageCLEF 2018 Medical Domain Visual Question Answering Task</article-title>
          .
          <source>In CLEF (Working Notes).</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Chamberlain</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campello</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wright</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clift</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Seco de Herrera</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          (
          <year>2019</year>
          ,
          <article-title>July). Overview of ImageCLEFcoral 2019 task</article-title>
          .
          <source>In CEUR Workshop Proceedings</source>
          (Vol.
          <volume>2380</volume>
          ).
          <source>CEUR Workshop Proceedings.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Sarrouti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            , &amp;
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain</article-title>
          .
          <source>In Proceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September</source>
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Sarrouti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            , &amp;
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain</article-title>
          .
          <source>In Proceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September</source>
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , Henning Muller,
          <string-name>
            <surname>Ana-Maria</surname>
            <given-names>Druagulinescu</given-names>
          </string-name>
          , Wen-wai
          <string-name>
            <surname>Yim</surname>
          </string-name>
          , Asma Ben Abacha, Neal Snider, Griffin Adams, Meliha Yetisgen, Johannes Ruckert, Alba Garcia Seco de Herrera,
          <string-name>
            <surname>Christoph M. Friedrich</surname>
          </string-name>
          , Louise Bloch, Raphael Brungel, Ahmad Idrissi-Yaghir,
          <article-title>Henning Schafer, Steven A</article-title>
          .
          <string-name>
            <surname>Hicks</surname>
            ,
            <given-names>Michael A.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
          </string-name>
          , Vajira Thambawita, Andrea Storås, Pål Halvorsen, Nikolaos Papachrysos, Johanna Schöler, Debesh Jha,
          <string-name>
            <surname>Alexandra-Georgiana</surname>
            <given-names>Andrei</given-names>
          </string-name>
          , Ahmedkhan Radzhabov, Ioan Coman, Vassili Kovalev, Alexandru Stan, George Ioannidis, Hugo Manguinhas,
          <string-name>
            <surname>Liviu-Daniel</surname>
            <given-names>Stefan</given-names>
          </string-name>
          , Mihai Gabriel Constantin, Mihai Dogariu, Jerome Deshayes, Adrian Popescu, Overview of ImageCLEF 2023:
          <article-title>Multimedia Retrieval in Medical, SocialMedia and Recommender Systems Applications Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 14th International Conference of the CLEF Association (CLEF</source>
          <year>2023</year>
          ), Springer Lecture Notes in Computer Science LNCS}, Thessaloniki, Greece.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Steven</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hicks</surname>
          </string-name>
          , Andrea Storås, Pål Halvorsen, Thomas de Lange,
          <article-title>Michael A. Riegler, Vajira Thambawita, Overview of ImageCLEFmedical 2023 - Medical Visual Question Answering for Gastrointestinal Tract</article-title>
          ,
          <source>CLEF2023 Working Notes, CEUR Workshop Proceedings, September 18-21</source>
          ,
          <year>2023</year>
          , Thessaloniki, Greece.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Noor</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            , &amp;
            <surname>Srinivasan</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>A comprehensive interpretation for medical VQA: Datasets, techniques, and challenges</article-title>
          .
          <source>Journal of Intelligent &amp; Fuzzy Systems, (Preprint)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>A survey on vision transformer</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          ,
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <fpage>87</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anandkumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alvarez</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>SegFormer: Simple and efficient design for semantic segmentation with transformers</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          ,
          <volume>34</volume>
          ,
          <fpage>12077</fpage>
          -
          <lpage>12090</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yatskar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsieh</surname>
            ,
            <given-names>C. J.,</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K. W.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Visualbert: A simple and performant baseline for vision and language</article-title>
          . arXiv preprint arXiv:
          <year>1908</year>
          .03557.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>