<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Kasukabe Defense Group at MEDIQA-MAGIC 2025: Clinical Visual Question Answering with Resource-Eficient Multi-modal Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Khushi Bahadur Desai</string-name>
          <email>khushibdsai@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Varunkumar S Hiregoudar</string-name>
          <email>uma@kletech.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ishan G Kulkarni</string-name>
          <email>ikishankulkarni16@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ratan Dhane</string-name>
          <email>ratandhane748@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Padmashree Desai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sujatha C</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Uma Mudenagudi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ramesh Ashok Tabib</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KLE Technological University</institution>
          ,
          <addr-line>Hubli</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Dermatology is among the more visually demanding specialties of clinical medicine, where correct diagnosis is heavily dependent on the evaluation of skin lesion appearance, texture, and site. With the increasing takeup of telemedicine, the demand for strong, automatic systems capable of interpreting dermatoscopic images along with clinical queries is increasingly becoming necessary. To this end, we introduce an eficient and competitive multimodal model for the closed-ended Visual Question Answering task in dermatology. Our model blends DistilBERT to learn textual features and EficientNet-B0 to learn visual features, balancing performance with computational cost. With the late fusion approach augmented using QID-specific classification heads, our model eficiently learns to accommodate the variability present in the dermatological queries within the dataset. We make large-scale comparisons with larger models such as ClinicalBERT, RoBERTa, Swin Transformers, and promptengineered counterparts. Although the top-performing model is 0.54 on macro-F1, our DistilBERT-EficientNet-B0 baseline scores competitive accuracy with many fewer parameters and faster inference. These results highlight the utility of eficient encoders and modular design for real-world clinical decision support. Our result on the MEDIQA-MAGIC 2025 dataset demonstrates the eficacy of resource-limited models for large-scale, multimodal applications in healthcare.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Dermatology is one of the most visually focused areas in clinical medicine, where making an accurate
diagnosis often depends on carefully observing the appearance, texture, and location of skin lesions.
In recent years, telemedicine has made it easier for patients to get dermatological care remotely by
sharing images and descriptions of their skin conditions. However, this shift also brings new challenges:
Automated systems now need to understand and analyze both the images and the accompanying clinical
questions to provide meaningful answers.</p>
      <p>Medical Visual Question Answering (VQA), especially in dermatology, is an exciting and growing
research area aimed at tackling these challenges. The goal is to build models that can simultaneously
interpret images and clinical questions to support diagnosis or patient care. While powerful architectures
like RoBERTa and Swin Transformer have shown strong abilities to capture semantic and visual details,
they often overfit when trained on smaller datasets, struggle with uneven question types, and demand
heavy computational resources.</p>
      <p>
        To address these issues, we propose a lightweight but efective multi-modal model designed specifically
for closed-ended VQA tasks in dermatology. Our approach uses DistilBERT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to process the text and
EficientNet-B0 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to extract visual features. This combination helps keep the model fast and eficient
without compromising too much accuracy. Further, with its dynamic, question-specific classification
heads and late fusion strategy, our system is able to accommodate a large number of relevant clinical
questions without entailing a significant increase in complexity.
      </p>
      <p>
        We have extensively evaluated our model on the MEDIQA-MAGIC 2025 DermaVQA-DAS database [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
and the results indicate that it achieves a very good trade-of between performance and computational
cost. It performs similarly to larger, more complex models and can be more suitable in real-world
clinical settings with low computing power and data quality.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The survey Medical Visual Question Answering: A survey by Zhihong et al. reviewed methods from a
large number of published papers in medical VQA [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Their study identifies central elements, such as
model architecture and feature fusion method. From the many methods, the joint embedding framework
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] was found to be the most extensively used [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ]. This architecture combines independent encoders
for image and question input, a feature fusion mechanism, and an answering module specific to the
task, e.g., Multiple Choice Questions (MCQ) or open questions. Image encoders are usually built on
top of proven Convolutional Neural network (CNN) backbones such as ResNet [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and VGG Net [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
whereas question encoders tend to use language models like Transformers [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] or LSTM [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Such
encoders are often pre-trained with weights and can be frozen or end-to-end fine-tuned when training
VQA models. The answering module is usually realized as either a neural network classifier for MCQs
or a generative model for open-ended questions. A popular fusion approach is concatenation that
directly combines question and image features. More recent methods have used attention mechanisms
to enhance feature fusion. In general, although architecture diversity in the field is quite limited,
joint embedding models—generally utilizing VGG Net and EficientNet-B0 for image encoding and
diferent models like LSTM, Bi-LSTM [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], GRU [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and transformer-based encoders such as BERT
and DistilBERT for text—predominate [
        <xref ref-type="bibr" rid="ref15">15, 16, 17</xref>
        ]. But a lightweight natural language encoder such as
LSTM could be the best fit to optimize eficiency and resources for medical VQA as the task generally
deals with fewer question types [
        <xref ref-type="bibr" rid="ref8">18, 19, 16, 8</xref>
        ]. In order to address various kinds of questions like MCQs
of varying number of options, open-ended questions and many more. Multiple Classifier Heads are
proposed which are question type specific. Each head is tasked with predicting the response to a specific
kind of question so that the model can better understand the semantics of various types of questions.
      </p>
      <p>In addition to subtask strategies, other techniques have also been investigated. Global Average
Pooling (GAP) [20], for instance, substitutes standard fully connected layers with averaged feature maps,
resulting in better generalization and less overfitting. Other developments are Embedding-based Topic
Latent Question Semantics modeling to incorporate latent question semantics, Question-Conditioned
Reasoning for input query-guided dynamic decision-making, and Image Size Encoders to consider spatial
properties of medical images. These methods together emphasize the increasing trend of including
domain-sensitive and task-sensitive mechanisms in medical VQA systems.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Methodology</title>
      <p>Our framework is based on a modular late-fusion model intended to jointly reason dermatological
images and associated clinical questions for closed-ended VQA. Every model input is a dermatoscopic
image with a clinical question marked by a Question Identifier (QID). The model aims to predict the
right answer index for each QID-image pair.</p>
      <p>To transform the text input, we use DistilBERT, a distilled variant of BERT that achieves an optimal
trade-of between representational capacity and computational cost. Every clinical question goes
through tokenization and then through DistilBERT. The representation of the [CLS] token is taken and
fed through a linear layer to generate a 256-dimensional embedding capturing the semantic content of
the query.</p>
      <p>On the visual front, we use EficientNet-B0—a small convolutional neural network with an excellent
performance-to-parameter ratio. We resize each input image to 224 × 224 pixels and normalize it before
passing it through the EficientNet backbone. The end convolutional features are pooled globally using
average pooling to produce a 1280-dimensional vector of an image, which is subsequently projected to
256 dimensions by a linear layer.</p>
      <p>The image and text projected embeddings are concatenated to produce a 512-dimensional
multimodal representation. To address the challenge’s requirement of multiple diferent question types being
answered, we add a QID-specific set of classification heads. Each head predicts a probability distribution
over available answers for a question, allowing the model to tailor its predictions to each question
type’s semantics.</p>
      <p>The model is trained end-to-end with categorical cross-entropy loss, with the correct classification
head chosen dynamically depending on the QID of each sample. We train the network with AdamW
optimizer and learning rate 2 × 10− 5 and measure performance with the macro-F1 score to counteract
the class imbalance. Training continues for as many as 25 epochs with early stopping by stagnation
in validation F1 score. At inference, each test image is combined with all valid QIDs, and predictions
are produced with their corresponding classifier heads. The final predictions are formatted in the
appropriate JSON format for submission.The current implementation predicts a single best answer for
each question. Extending the model to support multiple-answer predictions could be considered in
future work</p>
      <sec id="sec-3-1">
        <title>3.1. Architecture</title>
        <p>M1 – Baseline Multi-modal Model: M1 uses the basic architecture outlined above, with DistilBERT
as the text encoder and EficientNet-B0 as the image encoder. This model serves as the basis of our
solution and sets a solid baseline that is both eficient and efective. It has a shared classification head
for all question types.</p>
        <p>M2 – Clinical Text and Larger Image Encoder: In M2, we replace DistilBERT with ClinicalBERT,
a language model trained on clinical text, which increases the model’s capacity to understand
domainspecific jargon. We also improve the visual encoder to EficientNet-B3, a deeper variant that can detect
more subtle visual features. The model retains late fusion and a shared classifier head.</p>
        <p>M3 – RoBERTa with Prompt Engineering: To enhance text understanding, M3 substitutes the text
encoder with RoBERTa-base, which is renowned for its excellent contextual insight. We incorporate
prompt engineering methods by rephrasing and augmenting the input queries in a manner that enables
the model to read more pertinent linguistic signals. The image encoder is still EficientNet-B3, while
predictions are derived via a shared classification head.</p>
        <p>M4 – Domain-Aware Transformers: Model M4 uses Bio_ClinicalBERT as text encoding, a model
that has been pre-trained on biomedical corpora. It uses the Swin Transformer for processing images,
which is robust at learning spatial relationships with its hierarchical attention mechanism. The model
continues to use the late-fusion scheme and shared head but gets improved domain-specific feature
extraction in both modalities.</p>
        <p>M5 – Question-Specific Decoding with Weighted Loss: M5 keeps the backbone encoders of
M3 (RoBERTa-base and EficientNet-B3) but adds a collection of specialized classifier heads, one for
each QID, to facilitate fine-grained reasoning across question types. To compensate for skewed label
distributions, the training loss is compensated with class-weighted cross-entropy to pay closer attention
to minority classes.</p>
        <p>M6 – Prompt-Augmented and Fine-Tuned Model: Extending M5, this model incorporates
contextual cues like patient history as direct input to the input text in order to enhance semantic
awareness. Also, the entire model is end-to-end fine-tuned on the VQA dataset. These changes as a
whole provide the best performance among all the tested configurations.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Training and Validation</title>
      <p>The proposed framework presents a domain-adapted, eficient, and explainable multi-modal system
for closed-ended VQA in dermatology as part of the MEDIQA-MAGIC 2025 competition. The problem
consists of interpreting dermoscopic images and answering structured clinical queries based on visual
and textual information. This is motivated by the operational limitations common to real-world
deployments, including computational resource constraints, unbalanced question distributions, and
patient-generated imagery variations.</p>
      <p>Figure 1 depicts the end-to-end workflow of our model during both the training and inference steps.
Architecturally, we use a late fusion approach, where image and text features are encoded separately
and subsequently combined to create a single multi-modal representation employed for classification.</p>
      <p>For the text modality, we use DistilBERT, a distilled BERT that preserves much of the linguistic ability
of BERT but is faster and less resource-hungry. Clinical queries are tokenized and fed into DistilBERT,
from which the [CLS] token embedding is taken as a semantic summary. This embedding is linearly
projected into a 256-dimensional latent space.</p>
      <p>From the visual perspective, we utilize EficientNet-B0, a small but eficient CNN, which is pretrained
on ImageNet. Dermatoscopic images are normalized and resized and then fed into the encoder. Deep
visual features are obtained with GAP such that we get a 1280-dimensional vector, which is then
projected to 256 dimensions by another projection layer in order to be compatible with the features of
text.</p>
      <p>The two 256-dimensional vectors are concatenated into a 512-dimensional joint representation.
To address the variability of clinical questions—each of them defined by a QID—we create a modular
classifier head mechanism. A diferent classifier is attached to each QID, enabling the model to specialize
and generate predictions sensitive to the semantics of each question type. This architecture facilitates
multi-task learning and enhances robustness over heterogeneous clinical attributes, like lesion size,
color, texture, and anatomical location.</p>
      <p>We train the model end-to-end with categorical cross-entropy loss. The correct head for each instance
is dynamically chosen based on the corresponding QID. We address class imbalance by measuring
model performance with macro F1 score and using early stopping on validation F1 gains. We optimize
the model with the AdamW optimizer with learning rate 2 × 10− 5, and train for 25 epochs, keeping
track of the best-performing checkpoint.</p>
      <p>While inferring, the model goes through each test image by combining it with all relevant questions.
Prediction is done based on the respective QID-specific heads, and the output is a structured JSON file
that maps every encounter_id and QID to its predicted answer index, following the MEDIQA-MAGIC
submission format.</p>
      <p>In short, the suggested framework is computationally eficient in terms of predictive accuracy, suitable
for being deployed in clinical or mobile settings. Its extensibility and modularity guarantee that it can
be extended to handle new question types and changing dermatology applications with ease.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>Understanding the capabilities and limitations of our model requires a well-defined evaluation set-up
grounded in representative datasets. In this section, we first describe the oficial DermaVQA-DAS dataset
prepared for the challenge, followed by an analysis of the model’s performance to the challenge-specific
evaluation metric.</p>
      <sec id="sec-5-1">
        <title>5.1. Dataset Description</title>
        <p>
          In this study, we use the dataset provided as part of the ImageCLEF 2025 MEDIQA-MAGIC challenge
[21], specifically designed for the closed-ended VQA task in dermatology. Figure 2 displays samples
from the DermaVQA-DAS dataset [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. This dataset aims to assess the ability of multi-modal models
to interpret and analyze over both clinical images and related structured questions related about skin
conditions.
        </p>
        <p>The image corpus is made up of high-resolution dermatological images gathered from real-world
clinical settings, as well as patient-generated submissions. Each image corresponds to a unique encounter,
identified by an encounter_id embedded in the filename (e.g., IMG_ENC00001_00001.jpg). A wide range
of dermatological issues are captured in diferent parts such as the back, abdomen, palms, and feet.
The dataset reflects real-world conditions in lighting, skin tones, and lesion types — including rashes,
pigmentations, bumps, and more—making it well-suited for training robust models.</p>
        <p>Accompanying each image is a set of closed-ended questions. The questions follow a consistent
schema, and each one is linked to a unique identifier QID. For example, a question like “How much
of the body is afected?” may ofer fixed response options such as “single spot,” “limited area,” or
“widespread.” These structured options enable consistent supervision across training samples. The
definitions of all questions and their answer choices are provided in a dedicated JSON file
(closedquestions_definitions_imageclef2025.json), which our model parses dynamically during both training and
inference. Each image is associated with multiple questions, which allows for a multi-task learning
setup where the model must produce separate predictions for diferent aspects of the same case.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Evaluation Metrics</title>
        <p>To evaluate our system performs on the dermatology VQA task, we used the oficial evaluation script
provided by the MEDIQA-MAGIC 2025 challenge organizers. This script is specifically designed to
evaluate models that generate closed-ended clinical responses to image-related questions in dermatology.
Each question is associated with a unique identifier (e.g., CQID010-001), allowing for easy tracking
and performance analysis across diferent question categories.</p>
        <p>Unlike traditional classification metrics, this evaluation employs a Jaccard Index-based accuracy
measure—commonly known as Intersection over Union (IoU) accuracy. This metric is particularly
advantageous in cases where multiple correct answers may exist, as it provides partial credit for
overlapping predictions. The IoU-based accuracy for a single prediction is defined as in the equation 1.
(1)
(2)
Accuracy(, ) =</p>
        <p>| ∩ |
max(||, ||)
where  is the set of predicted labels and  is the set of true labels. For most single-label questions,
this metric simplifies to a binary match.</p>
        <p>To compute the overall performance score, the script calculates the average IoU-based accuracy
across all question-image pairs as shown in the equation 2.</p>
        <p>Total Accuracy = 1 ∑︁</p>
        <p>| ∩ |
=1 max(||, ||)
where  is the total number of evaluated samples.</p>
        <p>In addition to the overall accuracy, the evaluation tool provides a detailed breakdown by question
category—such as lesion size, afected body area, or skin texture. This enables a more fine-grained
analysis of model performance across clinically relevant subgroups.</p>
        <p>By leveraging this IoU-based metric, the evaluation framework aligns more closely with real-world
clinical practice, where ambiguity and partial correctness are common. It ofers a more nuanced and
forgiving measure of model efectiveness, as opposed to strict all-or-nothing correctness.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Model comparison</title>
        <p>Model M2, which uses ClinicalBERT alongside a deeper image encoder (EficientNet-B3 ), reached
similar accuracy but required more computational resources. Although it was more complex, the
improvements over M1 were marginal. Models M3 through M6 introduced additional strategies—such
as prompt engineering, advanced attention mechanisms, and full fine-tuning—but these did not lead to
consistently higher scores.</p>
        <p>Interestingly, the most sophisticated setup, Model M6, turned out to be the weakest performer overall.
This outcome suggests that adding complexity does not always translate into better results, especially
when data availability is limited or question types are unevenly distributed.</p>
        <p>To validate our findings, we submitted the top three models for oficial evaluation on the test set.
The results are presented in Table 2, which reflects the final macro-F1 scores provided by the
MEDIQAMAGIC 2025 organizers. Model M1 again achieved the strongest performance, followed closely by M4.
These results reinforce the idea that simpler, well-balanced designs can be more reliable in real-world
applications than more elaborate alternatives.</p>
        <p>Overall, these findings indicate that M1’s streamlined approach ofers a strong balance of performance,
speed, and ease of use. In clinical scenarios where reliability and eficiency are priorities, a clear and
focused architecture like M1 can often be the most efective solution.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In our proposed framework, a light yet efective multi-modal model specifically designed for the
closedended visual question-answering task in dermatology. Through integrating DistilBERT for processing
clinical questions with EficientNet-B0 for dermatoscopic image analysis, our system finds a robust
balance between accuracy and computation time.</p>
      <p>Our testing on the MEDIQA-MAGIC 2025 dataset indicated that this reduced design is as efective as
more elaborate, resource-intensive models. Due to its simplicity and resilience, the model is especially
conducive to real-world deployment—particularly in clinical settings where computing resources might
be restricted and image quality can be inconsistent. Its modular architecture, combined with
specific classifier heads per question type, allows for flexible adaptation to virtually any dermatologic query.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>Either:</title>
        <p>The author(s) have not employed any Generative AI tools.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Or (by using the activity taxonomy in ceur-ws.org/genai-tax.html):</title>
        <p>During the preparation of this work, the author(s) used X-GPT-4 and Gramby in order to: Grammar and
spelling check. Further, the author(s) used X-AI-IMG for figure 1 in order to: Generate images. After
using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full
responsibility for the publication’s content.
International Conference, ICONIP 2020, Bangkok, Thailand, November 18–22, 2020, Proceedings,
Part IV 27, Springer, 2020, pp. 194–202.
[16] Y. Khare, V. Bagal, M. Mathew, A. Devi, U. D. Priyakumar, C. Jawahar, Mmbert: Multimodal bert
pretraining for improved medical vqa, in: 2021 IEEE 18th international symposium on biomedical
imaging (ISBI), IEEE, 2021, pp. 1033–1036.
[17] Q. Xiao, X. Zhou, Y. Xiao, K. Zhao, Yunnan university at vqa-med 2021: Pretrained biobert for
medical domain visual question answering., in: CLEF (Working Notes), 2021, pp. 1405–1411.
[18] B. Liu, L.-M. Zhan, X.-M. Wu, Contrastive pre-training and representation distillation for medical
visual question answering based on radiology images, in: Medical Image Computing and Computer
Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September
27–October 1, 2021, Proceedings, Part II 24, Springer, 2021, pp. 210–220.
[19] L.-M. Zhan, B. Liu, L. Fan, J. Chen, X.-M. Wu, Medical visual question answering via conditional
reasoning, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp.
2345–2354.
[20] M. Lin, Q. Chen, S. Yan, Network in network, arXiv preprint arXiv:1312.4400 (2013).
[21] W. Yim, A. Ben Abacha, N. Codella, R. A. Novoa, J. Malvehy, Overview of the mediqa-magic task at
imageclef 2025: Multimodal and generative telemedicine in dermatology, in: CLEF 2025 Working
Notes, CEUR Workshop Proceedings, CEUR-WS.org, Madrid, Span, 2025.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>Distilbert, a distilled version of bert: smaller, faster, cheaper</article-title>
          and lighter,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1910</year>
          .01108. arXiv:
          <year>1910</year>
          .01108.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          , Eficientnet:
          <article-title>Rethinking model scaling for convolutional neural networks</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1905</year>
          .11946. arXiv:
          <year>1905</year>
          .11946.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <article-title>Dermavqa-das: Dermatology assessment schema (das) and datasets for closed-ended question answering and segmentation in patient-generated dermatology images</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hafari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <article-title>Medical visual question answering: A survey</article-title>
          ,
          <source>Artificial Intelligence in Medicine</source>
          <volume>143</volume>
          (
          <year>2023</year>
          )
          <fpage>102611</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Antol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          , Vqa:
          <article-title>Visual question answering</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>2425</fpage>
          -
          <lpage>2433</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Huang</surname>
          </string-name>
          , G. Chen,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Sysu-hcp at vqa-med 2021: A data-centric model with eficient training methodology for medical visual question answering</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2021</year>
          . URL: https://api.semanticscholar.org/CorpusID:237298665.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Gong</surname>
          </string-name>
          , G. Chen, S. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Cross-modal self-attention with multi-task pretraining for medical visual question answering</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2105.00136. arXiv:
          <volume>2105</volume>
          .
          <fpage>00136</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Purushotham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <surname>Medfusenet:</surname>
          </string-name>
          <article-title>An attention-based multimodal deep learning model for visual question answering in the medical domain</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>11</volume>
          (
          <year>2021</year>
          )
          <fpage>19826</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          ,
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Long short-term memory</article-title>
          ,
          <source>Neural computation 9</source>
          (
          <year>1997</year>
          )
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Bidirectional lstm-crf models for sequence tagging</article-title>
          ,
          <source>arXiv preprint arXiv:1508</source>
          .
          <year>01991</year>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Empirical evaluation of gated recurrent neural networks on sequence modeling</article-title>
          ,
          <year>2014</year>
          . URL: https://arxiv.org/abs/1412.3555. arXiv:
          <volume>1412</volume>
          .
          <fpage>3555</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.-Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gou</surname>
          </string-name>
          ,
          <article-title>Learning from the guidance: Knowledge embedded meta-learning for medical visual question answering</article-title>
          ,
          <source>in: Neural Information Processing: 27th</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>