<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal Learning for Skin Lesion Segmentation and Closed Visual Question Answering⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bhagyashree Mallanaikar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shradha Kekare</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Padmashree Desai</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sujata C</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Uma Mudenagudi</string-name>
          <email>uma@kletech.ac.in</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ramesh Ashok Tabib</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anjali Savalkar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aishwarya S.H</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Professor at KLE Tech</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Student at KLE Tech</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>India</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Accurate interpretation of medical images is essential for intelligent diagnostic systems. This project presents a unified deep learning framework that tackles two key challenges in medical image analysis: skin lesion segmentation and closed-ended visual question answering (VQA). For lesion segmentation, we introduce a model based on the Multi-Scale Feature Fusion Network (MSFNet), enhanced with boundary and reverse attention modules. This design improves the detection of irregular and low-contrast lesions. Tested on 314 dermatology images, the model achieved a mean Dice coeficient of 0.7021, a mean Jaccard index of 0.5410, and a maximum Dice score of 0.7512, supporting its efectiveness in aiding early melanoma detection. In parallel, our closed-ended VQA system combines visual feature extraction with language embeddings to answer structured questions-such as "yes/no," object types, and numeric values. On a set of 56 question-image pairs, it achieved 56.98% overall accuracy, with high scores in categories CQID012 (74.80%) and CQID035 (74.00%). Together, these results showcase the promise of deep learning in multi-modal medical image understanding. The integration of segmentation and VQA in a single pipeline highlights its potential for real-world applications, including clinical decision support, assistive tools, and automated medical interpretation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Skin Lesion Segmentation</kwd>
        <kwd>Visual Question Answering (VQA)</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>MSFNet</kwd>
        <kwd>Multi-modal Reasoning</kwd>
        <kwd>Medical AI</kwd>
        <kwd>Natural Language Processing CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Skin lesion segmentation is crucial for the early detection of melanoma and other dermatological
conditions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Traditional methods using hand-crafted features often fail due to low contrast and
irregular lesion boundaries. Deep learning models like U-Net [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], DeepLabV3+ [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and Attention
U-Net[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] have significantly improved performance but struggle with fine boundary refinement and
semantic ambiguity.
      </p>
      <p>
        To address these challenges, we adopt the Multi-Scale Feature Fusion Network (MSFNet), which ofers
an efective balance between precision and computational eficiency through its integration of attention
modules and hierarchical feature fusion[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Our work evaluates this architecture on a challenging
real-world dataset as part of the MEDIQA-MAGIC 2025 segmentation task [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>The remainder of this paper is organized as follows: Section 2 reviews related work on skin lesion
segmentation. Section 3 describes the proposed MSFNet architecture and methodology. Section ?? details
the training strategy used, while Section ?? presents the evaluation metrics employed for performance
assessment. Section ?? discusses the experimental results. Finally, Sections 6 provide the conclusion
and future scope.</p>
      <p>
        Closed Visual Question Answering (Closed-VQA) is a vision-language task where systems answer
image-based queries using a fixed set of responses [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], crucial for applications like medical diagnostics
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], autonomous systems [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and HCI [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Unlike open-ended VQA, it ensures interpretability and
control, vital for high-stakes domains [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Despite progress using CNNs and Transformers challenges
remain in compositional reasoning and domain-specific contexts . To address this, we propose a model
with cross-modal attention and fine-grained alignment for reliable, interpretable predictions in clinical
and industrial applications.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background Study</title>
      <p>
        U-Net [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] established encoder-decoder networks with skip connections for biomedical segmentation.
Subsequent works like DeepLabV3+ [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] introduced atrous convolution for multi-scale context, while
Attention U-Net[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] incorporated attention gates to improve focus on lesion areas. MSFNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] combines
parallel partial decoders (PPD), boundary attention (BA), and reverse attention (RA) modules to enhance
edge sensitivity and semantic integration.
      </p>
      <p>
        Other alternatives like Vision Transformers[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], MedT [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and GAN-based methods[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] show
promising results but with higher computational demands. Lightweight approaches such as SL-HarDNet
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and hybrid optimization frameworks have emerged to address eficiency and generalization,
especially in mobile or clinical settings.
      </p>
      <p>
        Visual Question Answering (VQA) is a complex task requiring joint understanding of images and
language, with Closed VQA framing it as a multi-class classification problem for consistent evaluation
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Early models used CNNs and LSTMs, but attention mechanisms like Bottom-Up and Top-Down
Attention improved performance [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Transformer-based models such as ViLBERT [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], MCAN [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],
BAN [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and ViLT [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] enabled better cross-modal reasoning, building on Vaswani et al.’s architecture
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Despite progress, challenges remain in dense scenes and low-resource settings, addressed by models
like LXMERT and Oscar.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Segmentation</title>
        <p>
          We employ the original MSFNet architecture[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], which integrates multi-scale feature extraction,
boundary refinement, and attention-driven fusion. The network utilizes a deep CNN backbone with five
convolutional blocks (Conv1–Conv5), capturing progressively abstract features from input images.
        </p>
        <p>Boundary Attention (BA) modules operate on intermediate layers (Conv2, Conv3) to emphasize
lesion edges. Parallel Partial Decoder (PPD) processes deep features from Conv4 and Conv5 to produce
a coarse semantic prediction. Reverse Attention (RA) modules then iteratively refine this prediction by
re-focusing on uncertain boundary regions.</p>
        <p>Finally, the outputs of BA, RA, and PPD modules are fused using element-wise addition and
convolution layers to generate the final lesion mask. A sigmoid activation produces a binary probability map
for segmentation.</p>
        <p>
          As this architecture is reused without structural modification, detailed formulations are omitted here
and can be found in[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Closed VQA</title>
        <p>
          The proposed VQA framework for the ImageCLEF 2025 challenge [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] employs a multimodal architecture
that integrates clinical text queries and medical images. Clinical questions are encoded using a
BERTbased model, while images from the DermaVQA-DAS dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] undergo dual-path feature extraction
via Vision Transformer (ViT) and Local Binary Patterns (LBP). The resulting visual and textual features
are fused through outer product-like matching to capture fine-grained cross-modal associations. This
interaction enables accurate answer retrieval through similarity-based matching
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Training and Evaluation Summary</title>
      <p>
        For segmentation, all training was conducted using the gold-standard masks provided in the
MEDIQAMAGIC 2025 dataset [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], comprising 314 pixel-annotated dermatology images. To maintain consistency
and reduce computational load, both input images and corresponding ground truth masks were resized
to 352 × 352 pixels. The model was trained using a hybrid loss function designed to balance pixel-level
accuracy and region-level overlap:
ℒtotal =  · ℒ WBCE +  · ℒ IoU
(1)
      </p>
      <p>Here, ℒWBCE denotes the Weighted Binary Cross-Entropy Loss, which mitigates class imbalance by
assigning greater importance to underrepresented lesion pixels, while ℒIoU is the
Intersection-overUnion Loss that directly promotes region overlap between prediction and ground truth. The coeficients
were empirically set as  = 1 and  = 1 in all experiments.</p>
      <p>To enhance generalization and prevent overfitting, standard data augmentation techniques such as
lfipping, rotation, and contrast adjustment were applied during training. The Adam optimizer was
employed with an initial learning rate of 1 × 10− 4. A learning rate decay strategy was incorporated,
reducing the learning rate by a factor of 0.1 if the validation loss did not improve for 10 consecutive
epochs. Training was conducted over a maximum of 100 epochs using a batch size of 8, with early
stopping enabled based on validation loss. During inference, predicted masks were upsampled to the
original image resolution using bilinear interpolation.</p>
      <p>For evaluation, segmentation performance was quantified using the Dice coeficient and mean
Intersection over Union (mIoU). The Dice score is computed as:</p>
      <p>Dice(, ) = 2  (2)</p>
      <p>2  +   +  
where   ,   , and   represent the true positives, false positives, and false negatives, respectively.
The mIoU metric averages IoU across all instances:</p>
      <p>mIoU = 1 ∑︁</p>
      <p>=1   +   +</p>
      <p>For Visual Question Answering (VQA), the model was trained in a supervised manner to classify inputs
into one of 12 answer categories. The input comprises a fused feature vector of dimension 1024, created
by concatenating a 512-dimensional text embedding (extracted using CLIP) and a 512-dimensional
visual embedding (obtained from a Vision Transformer and Local Binary Patterns).</p>
      <p>The classifier outputs logits over the answer space, which are converted into class probabilities using
the softmax function:
(3)
(4)
(5)
(6)
The model was trained to minimize the categorical cross-entropy loss:</p>
      <p>^
 ( = |x) = ∑︀=1 ^ ,  ∈ {1, 2, . . . , }</p>
      <p>1 ∑︁ log  (|x)
ℒ( ) = −</p>
      <p>=1
where  is the number of training samples,  is the ground truth label, and  (|x) is the predicted
probability for the true class. Optimization was performed using the AdamW optimizer with a fixed
learning rate of 1 × 10− 4, and training was run for 1000 epochs with a batch size of 32. Mixed precision
training on CUDA-enabled GPUs was used to accelerate computation and reduce memory usage.</p>
      <p>Evaluation was conducted according to the ImageCLEF VQA-Med 2024 protocol. The performance
was measured using macro-averaged accuracy across grouped question types. Let ℰ represent the set of
encounter IDs and  the set of grouped question IDs. For each encounter-question pair (, ), with
gold-standard answers , and predictions ,, instance-level accuracy is defined as:
Accuracy(, ) =</p>
      <p>|, ∩ ,|
max(|,|, |,|)
Group-level accuracy is computed as:
And the final macro-averaged accuracy across all question types is given by:</p>
      <p>Accuracy =
|ℰ | ∈ℰ
Accuracyoverall =
|| ∈
1 ∑︁ Accuracy
(7)
(8)</p>
      <p>The evaluation process involved parsing both gold and predicted JSON files, grouping responses
by question and encounter ID, and computing the aforementioned accuracy metrics. Predictions with
missing instances were assigned an accuracy of zero to ensure consistency and fairness across all model
submissions.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>
        We evaluated the segmentation model on the MEDIQA-MAGIC 2025 DermaVQA-DAS dataset [
        <xref ref-type="bibr" rid="ref19 ref6">6, 19</xref>
        ],
comprising 314 annotated dermatology images. Performance was measured using Dice and Jaccard
indices, including both mean-of-mean and mean-of-maximum variants. The model achieved a mean Dice
coeficient of 0.7021 and a mean Jaccard index of 0.5410, indicating strong lesion overlap accuracy.
Bestcase alignment was reflected by Dice (mean of max) at 0.7512 and Jaccard (mean of max) at 0.6377, while
Dice (mean of mean) and Jaccard (mean of mean) were 0.6711 and 0.5538, respectively, demonstrating
stable performance across the dataset. A visual example of segmentation outputs, including original
images, ground truth, and model predictions, is shown in Figure 3.
      </p>
      <p>We also evaluated our closed-ended Visual Question Answering (VQA) model on a test dataset
containing 56 image-question pairs as part of the ImageCLEF VQA-Med 2024 task. Submitted under
the team name KLE1 (Rank 12), the model was assessed based on per-question-type accuracy and
overall performance. As shown in Table 2, the model achieved an overall accuracy of 56.98%. It
performed particularly well in CQID012 (74.80%) and CQID035 (74.00%), indicating strength in those
question categories. However, it showed lower performance in CQID034 (39.00%) and CQID036 (35.00%),
suggesting scope for improvement in those areas. These results confirm the model’s efectiveness in
handling diverse closed-form visual questions while identifying areas for future refinement.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>Our MSFNet-based segmentation framework demonstrated strong performance on the MEDIQA-MAGIC
2025 dataset, achieving a mean Dice coeficient of 0.7021 and Jaccard index of 0.5410 across 314 skin
lesion instances. The hybrid loss formulation, combining Weighted Binary Cross-Entropy and IoU
losses, was efective in handling class imbalance and enhancing boundary delineation. These results
highlight the model’s robustness and generalization capability in binary lesion segmentation. For
future enhancements, we aim to extend the framework to multi-class segmentation, integrate advanced
attention modules, explore alternative loss strategies, and optimize the model for real-time clinical or
mobile deployment.</p>
      <p>In parallel, our closed-ended Visual Question Answering (VQA) model attained an overall accuracy of
56.98% on a 56-pair test set, performing notably well in specific categories such as CQID012 and CQID035.
This reflects the model’s potential in accurately interpreting visual inputs and providing consistent
answers to domain-specific questions. However, lower performance in certain categories also reveals
areas for improvement. Future work will focus on expanding training data diversity, incorporating
larger-scale multimodal datasets, and experimenting with transformer-based and attention-driven
fusion architectures. These eforts aim to boost model generalization and accuracy across broader
medical VQA tasks.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, we have used generative AI tool (ChatGPT) for tasks such as
grammar checking and paraphrasing. All AI-generated content was reviewed and edited by the authors,
who take full responsibility for the final version of the manuscript.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Online Resources</title>
      <p>The sources for the ceur-art style are available via
• GitHub,
• Overleaf template.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Esteva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kuprel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Swetter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Blau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thrun</surname>
          </string-name>
          ,
          <article-title>Dermatologist-level classification of skin cancer with deep neural networks</article-title>
          ,
          <source>nature</source>
          <volume>542</volume>
          (
          <year>2017</year>
          )
          <fpage>115</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-net:
          <article-title>Convolutional networks for biomedical image segmentation, in: Medical image computing and computer-assisted intervention-MICCAI 2015: 18th international conference</article-title>
          , Munich, Germany, October 5-
          <issue>9</issue>
          ,
          <year>2015</year>
          , proceedings,
          <source>part III 18</source>
          , Springer,
          <year>2015</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.-C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Papandreou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schrof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <article-title>Encoder-decoder with atrous separable convolution for semantic image segmentation</article-title>
          ,
          <source>in: Proceedings of the European conference on computer vision (ECCV)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>801</fpage>
          -
          <lpage>818</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Oktay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schlemper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Folgoc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Misawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McDonagh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. Y.</given-names>
            <surname>Hammerla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kainz</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Attention</surname>
          </string-name>
          u-net:
          <article-title>Learning where to look for the pancreas</article-title>
          , arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>03999</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Basak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kundu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <article-title>Mfsnet: A multi focus segmentation network for skin lesion segmentation</article-title>
          ,
          <source>Pattern Recognition</source>
          <volume>128</volume>
          (
          <year>2022</year>
          )
          <fpage>108673</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <article-title>Overview of the mediqa-magic task at imageclef 2025: Multimodal and generative telemedicine in dermatology</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Antol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          , Vqa:
          <article-title>Visual question answering</article-title>
          , in: ICCV,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Medical visual question answering: A survey, Medical Image Analysis (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vedantam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <article-title>Microsoft coco captions: Data collection and evaluation server</article-title>
          ,
          <source>in: arXiv preprint arXiv:1504.00325</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Faghri</surname>
          </string-name>
          , I. Vulić,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <article-title>Supervised multimodal bitransformers for classifying images and text</article-title>
          ,
          <source>EMNLP</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>J. M. J. Valanarasu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Oza</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Hacihaliloglu</surname>
            ,
            <given-names>V. M.</given-names>
          </string-name>
          <string-name>
            <surname>Patel</surname>
          </string-name>
          ,
          <article-title>Medical transformer: Gated axial-attention for medical image segmentation, in: Medical image computing</article-title>
          and
          <source>computer assisted interventionMICCAI</source>
          <year>2021</year>
          :
          <article-title>24th international conference</article-title>
          , Strasbourg, France,
          <source>September 27-October 1</source>
          ,
          <year>2021</year>
          , proceedings,
          <source>part I 24</source>
          , Springer,
          <year>2021</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Innani</surname>
          </string-name>
          , et al.,
          <article-title>Eficient-gan: Adversarial learning framework with morphology-aware loss for skin lesion segmentation</article-title>
          ,
          <source>arXiv preprint arXiv:2305.18164</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chao</surname>
          </string-name>
          , et al.,
          <article-title>Sl-hardnet: A lightweight network for skin lesion segmentation with boundary enhancement</article-title>
          ,
          <source>Frontiers in Bioengineering and Biotechnology</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>1028690</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks</article-title>
          , in: NeurIPS,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <article-title>Deep modular co-attention networks for visual question answering</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>J.-H. Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Jun</surname>
          </string-name>
          , B.-T. Zhang,
          <article-title>Bilinear attention networks</article-title>
          ,
          <source>in: NeurIPS</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Natarjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hudson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <article-title>Towards vqa models that can read</article-title>
          ,
          <source>Proceedings of the IEEE/CVF Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2019</year>
          )
          <fpage>8317</fpage>
          -
          <lpage>8326</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <article-title>Dermavqa-das: Dermatology assessment schema (das) and datasets for closed-ended question answering and segmentation in patient-generated dermatology images</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>