<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DermoSegDif and DermKEM for Comprehensive Dermatology AI</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nguyen Pham Hoang Le</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hoang Pham Duc Huy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hien Thai Dinh Nhat</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hoang Thach Minh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thien B. Nguyen-Tat</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Information Technology</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The integration of AI into dermatological diagnostics is rapidly transforming clinical practice, with crucial applications in precise lesion segmentation and intelligent Visual Question Answering (VQA). Our team, H3N1, participated in ImageCLEF MAGIC 2025, tackling both these core challenges. We proudly announce our significant achievements: a top 4 finish in the Dermatology Segmentation Task and the winner position in the Dermatology VQA Task. These results validate the power of two advanced systems. Firstly, we use DermoSegDif , a system proposed by A. Bozorgpour et al. which revolutionizes skin lesion segmentation by leveraging a Denoising Difusion Probabilistic Model (DDPM) with novel boundary detection through weighted loss and 'Boundary Attention' for unparalleled contour precision (Jaccard: 0.514, Dice: 0.679). The modified U-Net with twopath feature extraction strategy helps capture diferent features, therefore providing the model with a more comprehensive view. Simultaneously, our developed DermKEM (Dermatology Knowledge-Enhanced Ensemble Model) system for Dermatology VQA excels with a knowledge-augmented multi-model ensemble. It employs a Genetic Algorithm for image enhancement, enriches captions via BLIP and external knowledge (Gemini 2.5 Flash), and feeds this into an ensemble of baseline models such as MUMC, and Gemini 2.5 Flash for highly accurate, contextually rich answers (Accuracy: 0.758). These synergistic systems not only demonstrate state-of-the-art capabilities but also redefine intelligent clinical support in dermatology.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The integumentary system, primarily comprising the skin, stands as the human body’s largest organ
and serves as a crucial interface with the external environment. Its multifaceted roles include providing
a protective barrier against pathogens and physical insults, regulating body temperature, facilitating
sensory perception, and contributing to immune responses. Consequently, dermatological diseases,
which encompass a wide array of conditions ranging from common inflammatory disorders such as
eczema and psoriasis to potentially life-threatening malignancies like melanoma, represent a significant
global health burden. The accurate and timely diagnosis of these conditions is paramount for efective
management and improved patient outcomes. Traditionally, dermatological diagnosis heavily relies
on visual inspection by trained clinicians, often augmented by non-invasive imaging techniques like
dermatoscopy. While this approach remains fundamental, its eficacy can be influenced by factors
such as inter-observer variability, the subtlety of early-stage lesion characteristics, and the clinician’s
experience level. Furthermore, the increasing demand for accessible dermatological care, particularly in
remote or underserved regions, has spurred the growth of teledermatology, where automated clinical
feedback is essential. In this context, objective and reliable automated decision support systems are
becoming increasingly vital. Artificial intelligence (AI) has demonstrated considerable potential in
augmenting diagnostic capabilities within various medical imaging domains, including dermatology.
Specifically, two key areas of AI research, dermatology image segmentation and Closed Visual Question
Answering (CVQA) for dermatology, are poised to revolutionize dermatological image analysis. Medical
image segmentation aims to precisely delineate regions of interest from surrounding healthy tissue
by generating masks that identify the afected areas. This provides quantitative data crucial for lesion
characterization and monitoring. Concurrently, VQA systems, which integrate computer vision with
natural language processing, enable clinicians to pose specific questions about an image (e.g., "How
much of the body is afected?") and receive contextually relevant, evidence-based answers. Such systems
can enhance diagnostic accuracy, streamline workflows, and facilitate more efective communication
between healthcare providers and AI tools. However, the development of robust AI models relies on
the availability of large-scale, high-quality, and diverse datasets, as well as rigorous benchmarking
to ensure quality and diversity of data. To address these challenges and foster advancements in AI
for dermatology, the ImageCLEFmed MEDIQA-MAGIC 2025 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] challenge has been established. This
initiative includes tasks directly relevant to dermatology: Task 1: Segmentation of Skin Conditions
and Task 2: Generative Closed-domain Question Answering on Dermatology Images. The competition
aims to stimulate the development of systems that can automatically generate clinical feedback in a
teledermatology context, leveraging both visual and textual data.
      </p>
      <p>
        Our team, H3N1, participated in both dermatological tasks. For Task 1 (Segmentation), we investigated
advanced deep learning architectures, potentially exploring difusion-based models like DermoSegDif
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for precise lesion boundary delineation, complemented by robust preprocessing techniques including
genetic algorithm-based image enhancement. For Task 2 (Closed VQA), we introduce DermKEM, a
system designed to leverage sophisticated multimodal approaches, drawing inspiration from established
frameworks such as MUMC [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for medical VQA and the capabilities of large multimodal models like
Gemini [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. These will be augmented by strategic data preprocessing including BLIP-based additional
image caption generation[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and external knowledge linking using Gemini [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This paper details our
methodologies, experimental setup, and results for the MEDIQA-MAGIC 2025 dermatological tasks,
contributing to the growing body of research on AI-driven solutions for enhanced dermatological care.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Recent works about AI in medical field have increased significantly[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]; especially in the field of
visual question answering (VQA), existing eforts have largely concentrated on radiological images.
VQA-Med 2019[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] specifically focused on radiology images and four main categories of questions. The
top-performing systems in the contest mainly employed deep learning techniques, using CNNs such as
VGGNet[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and ResNet[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to extract visual features, and models like BERT[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or RNNs to encode
the questions. Attention mechanisms and multimodal pooling methods such as MFB and MFH were
then used to fuse image and text features for answer prediction.In the MEDIQA-M3G 2024 Shared Task
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], researchers explored solutions for dermatological consumer-health visual question answering, in
which user-generated queries and images serve as input, and a free-text answer is produced as output.
The top performance for the English results was achieved by CLIP[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] (and its fine -tuned variant),
Claude[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] with prompt-based engineering, and PMC-VQA[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] (PMC-CLIP and PMC-LLaMA).
      </p>
      <p>
        In the medical field, segmentation is a crucial step for identifying and delineating abnormal skin
regions such as lesions, malignancies, or infected areas. A widely adopted architecture for this task is
U-Net[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], which features an encoder–decoder structure with skip connections that help retain detailed
spatial information. U-Net and its variants, such as UNet++[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and UNet 3+[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], have demonstrated
high efectiveness in segmentation.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. The Proposed Approach</title>
      <p>In this work, we addressed two primary tasks. For Task 1, we employed DermoSegDif,a
boundaryaware system for skin lesions segmentation. For Task 2, we developed DermKEM, an advanced visual
question answering (VQA) system tailored for dermatology. The architectures of these two systems are
illustrated in Figure 1 and Figure 2, respectively, and will be elaborated upon in subsequent sections.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Preprocessing</title>
        <sec id="sec-3-1-1">
          <title>3.1.1. Image Enhancement</title>
          <p>We noticed that there is no image enhancement method in the proposed model DermoSegDif, so
we provided our own Genetic Algorithm-based image enhancement. We generate a population of 20
through 10 generations. For each image, we randomly modify the constrast  and the brightness 
and calculate fitness using SSIM. Next, we select the top half by fitness and create new population by
averaging  and  from two randomly selected top individuals, adding small random variation in the
process. After 10 generations, we extract the best enhanced image based on SSIM with contrast  and
brightness  . Figure 3.1.1 shows that there is significant improvement around skin lesions, making it
clearer and easier to detect the boundary.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Image Resizing</title>
          <p>
            Due to the inconsistency of the images provided in the dataset, we apply a resize image method proposed
in DermoSegDif[
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] to ensure the uniformity of image size, therefore reducing training time due to
resources restrain. Firstly, images are transformed into tensor. Next, we resize image tensors, adding
interpolation to retain image information. We normalize image tensors to [
            <xref ref-type="bibr" rid="ref1">0,1</xref>
            ], replacing NaN values
with 0.5.
          </p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Additional images caption</title>
          <p>
            To generate descriptive image captions during preprocessing, we employed the BLIP (Bootstrapping
Language-Image Pre-training) model [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ]. To adapt this model for the dermatology domain, we
finetuned the pre-trained BLIP on the SkinCap dataset [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ], which contains 4,000 dermatology images with
corresponding captions. The model was fine-tuned for 8 epochs, achieving a BLEU score of 0.16 on the
SkinCap validation set. This confirmed the model’s improved proficiency for generating domain-specific
captions for our task.
          </p>
        </sec>
        <sec id="sec-3-1-4">
          <title>3.1.4. Linking External Knowledge</title>
          <p>
            In the Medical VQA task, image-generated captions often pose challenges for models that are not
extensively trained on specialized medical contexts. These captions are typically short and may fail to
accurately describe lesions or pathological signs. We leverage Gemini 2.5 Flash [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] via the Vertex AI
API to enrich the captions by linking external knowledge from reputable open-access medical sources
such as DermNet NZ (dermnetnz.org) and WikiDoc (wikidoc.org). This enriched captioning process
helps the model better understand the medical image before answering the question.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Methodology</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. Task 1: Segmentation</title>
          <p>
            For the skin lesion segmentation task, we employed DermoSegDif [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] by A. Bozorgpour et al., a
difusion-based segmentation model designed specifically for skin lesion segmentation. The model
leverages the generative capabilities of Denoising Difusion Probabilistic Models (DDPMs) while
incorporating boundary-aware mechanisms to enhance segmentation precision, especially around
lesion boundary.
          </p>
          <p>
            Difusion Process: DermoSegDif based on a standard DDPM framework that includes a forward
process - where Gaussian noise is gradually added to the ground truth segmentation mask —
and a reverse denoising process, which reconstructs the mask step-by-step with guiding image.
Rather than predicting the mask directly, the network learns to estimate the noise added at
each timestep, enabling more accurate reconstruction. Loss Function: To address the challenge of
fuzzy and ambiguous lesion boundaries, DermoSegDif introduces a novel boundary-aware loss function:
ℒ = (1 +  Θ) ‖ −   (, , )‖2
where Θ ∈ ( 1) is a dynamic weight map derived from a distance transform of the mask
boundary. This term increases the weight of pixels near the lesion boundary and decreases when
further from the boundary. The dynamic nature of Θ, based on the current time step , ensures the
progressive refinement of the boundary regions as the denoising proceeds. A gamma correction is
applied to the distance map to control the sharpness of attention around the boundary.
Denoising Network Architecture: The denoising network is a modified U-Net [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ] architecture with
a two-path feature extraction mechanism:
(1)
• The image-conditioned path extracts semantic features from the input image g, providing
contextual guidance throughout the reverse process.
• The latent-conditioned path processes the noisy segmentation mask x_t , learning to progressively
reduce noise.
          </p>
          <p>
            Each path contains ResNet blocks with separate time embeddings to capture distinct temporal
characteristics. The features of both paths are fused at multiple stages of the U-Net[
            <xref ref-type="bibr" rid="ref22">22</xref>
            ], allowing an
efective integration of semantic and noise-related representations. A dual-attention bottleneck module
further enhances the model’s ability to capture both spatial dependencies and long-range interactions.
          </p>
          <p>The decoder reconstructs the estimated noise (  ) by utilizing enriched skip connections from
the encoder, which contains both semantic and boundary-focused information. An additional skip
connection from the initial noisy input  to the output layer ensures the retention of noise characteristics
critical for accurate reconstruction.</p>
          <p>Inference Strategy with Sampling-based Ensemble: During inference, DermoSegDif adopts
a sampling-based ensemble strategy to enhance segmentation robustness. Specifically, the model
generates nine segmentation predictions for each test image by running the difusion sampling process
multiple times. These predictions are then averaged pixel-wise, followed by thresholding to produce
the final segmentation mask (threshold of 0). This ensemble method mitigates the stochastic nature of
difusion models and improves result consistency.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Task 2: Closed Visual Question Answering</title>
          <p>
            We experiment two kinds of model: traditional model and Vision Language model to discover the
performance of baseline models on dermatology visual question answering. For traditional model, we
ultilize MUMC[
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], the state-of-the-art (SOTA) model on medical visual question answering. For VLM,
we use Gemini 2.5 Flash[
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] through Vertex AI API.
          </p>
          <p>
            MUMC: MUMC[
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] uses a novel self-supervised pretraining method to eficiently learn to understand
and associate information from medical images and texts through information masking and multi-level
contrastive learning.
          </p>
          <p>
            Gemini 2.5 Flash: Gemini 2.5 Flash[
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] represents the latest advancement within the Gemini family
of models. It is specifically engineered to optimize for speed and cost-efectiveness, thereby ofering
a substantially faster and more lightweight alternative for tasks demanding low latency and high
throughput. While maintaining robust performance across diverse modalities including text, image,
and audio (and potentially video, contingent upon its specific capabilities), Gemini 2.5 Flash excels in
applications such as real-time interactive chat, text summarization, and on-device Artificial Intelligence.
This characteristic renders powerful multimodal understanding more accessible and practical for an
expanded range of use cases.
          </p>
          <p>Answer Shufle: For Vision-Language Models (VLMs), we experimented with shufling the order of
answer options to evaluate model consistency. Typically, answer options are mapped as follows: option
A to 1, option B to 2, option C to 3, and option D to 4. Subsequent to shufling, an example mapping
could be A to 3, B to 1, C to 4, and D to 2. The model is queried multiple times with these shufled
mappings, and the final selected option is determined by the answer choice that achieves the highest
frequency of selection. In instances where multiple answer choices receive the same highest frequency,
one is selected randomly, although this scenario was observed to occur infrequently.
Few-shot Learning: For VLMs, we implemented few-shot learning. For each input sample, we provided
a set of ground-truth examples randomly selected from the training dataset as in-context learning
prompts. The number of such examples was equivalent to the total number of questions in the evaluation
set, as detailed further in Section 5.</p>
          <p>Ensemble: We employed a hard-voting ensemble strategy, combining the outputs from multiple model
inferences. The final output was determined as the answer most frequently selected by the constituent
models. In cases of a tie for the most frequent answer, one was selected randomly.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Task and Dataset Descriptions</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset Descriptions</title>
        <p>
          We used the oficial ImageCLEFmed MEDIQA-MAGIC 2025 dataset [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], which builds upon
documentation from DermaVQA [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. The dataset supports two tasks and is divided into training (2474 images),
validation (157 images), and test (314 images) sets.
        </p>
        <p>This dataset facilitates a segmentation task and is structured as follows: The training set,
constituting 85%, consists of 2474 images, 7448 masks, and 842 queries. The validation set
contains 157 images, 472 masks, and 56 queries, while the test set is composed of 314 images, 944
masks, and 100 queries. Mask files are stored as binary TIFF files, adhering to the naming
convention IMG_{ENCOUNTERID}_{IMAGEID}_mask_{ANNOTATOR#}.tiff. Corresponding
image files are available in PNG or JPG format, named as IMG_{ENCOUNTERID}_{IMAGEID}.png
or IMG_{ENCOUNTERID}_{IMAGEID}.jpg. For the Closed QA task, the dataset includes
closed questions, associated images, A dictionary of all possible closed questions and the
option values associated with them, and a predefined list of 27 questions provided in the
closedquestions_definitions_imageclef2025.json file, with both English and Chinese
translations. Distribution of types of questions related to illnesses as show in Figure 5.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Task Definitions</title>
        <p>The second edition of the MEDIQA-MAGIC task focuses on multimodal dermatology response
generation. Building upon the previous year’s challenge, this task introduces more complex reasoning
by combining clinical narratives with associated dermatology images. The task is divided into two
sub-tasks:</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Segmentation</title>
        <p>• Definition: Given a clinical history and an associated dermatological image, participants are
required to generate segmentation masks that identify regions of interest related to the described
dermatological condition.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Closed VQA</title>
        <p>• Definition: Closed VQA Participants are provided with a dermatology-related query (clinical
narrative), one or more related images, and a multiple-choice question. The goal is to select the
correct answer from the provided options.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiment Results</title>
      <sec id="sec-5-1">
        <title>5.1. Implementation Details</title>
        <p>Our experiments for the ImageCLEF MAGIC 2025 challenge were conducted on the Kaggle platform,
utilizing two NVIDIA T4 GPUs and one NVIDIA P100 GPU.</p>
        <sec id="sec-5-1-1">
          <title>5.1.1. Task 1: Segmentation</title>
          <p>For our segmentation task, we used the DermoSegDif framework as the core model. To improve the
quality of our training data, we applied a Genetic Algorithm (GA) before training the segmentation
model. Since DermoSegDif doesn’t have built-in image enhancement, this step helped create clearer
and more informative images. The enhanced images were used to build new versions of the training,
validation, and test datasets. We trained the model with the following settings:
• Batch size: 8
• Input image size: 128 × 128 pixels
• Difusion settings:
– Timesteps: 250
– Beta schedule: Linear
–  start: 0.0004
–  end: 0.08
For optimization, we used the Adam optimizer with:
• Learning rate: 0.0001
• Betas: (0.7, 0.99)
• Weight decay: 0.0
We also included a learning rate scheduler (ReduceLROnPlateau) to help the model train more eficiently.
If the validation score didn’t improve after 5 epochs, the learning rate was cut in half. Each run lasted
up to 40,000 iterations, or ended earlier if the validation results stopped improving. No other parameters
was tested due to time constraint, as we believed the original parameter is the most optimal for the
task. For evaluation, we used an ensemble approach. Each test image was passed through the model 5
times, and we averaged the outputs pixel by pixel. The final result was turned into a binary mask using
a threshold. This method helped reduce noise and made the predictions more stable and accurate.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Task 2: Closed Visual Question Answering</title>
          <p>
            MUMC: We utilized the MUMC model. The model’s pre-training stage involved a version pre-trained
on three datasets: ROCO [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ], MedICaT [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ], and the ImageCLEF2022 Image Caption Dataset [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ].
Subsequently, its fine-tuning stage employed a model pre-trained on three public medical VQA datasets:
VQA-RAD [28], PathVQA [29], and SLAKE [30]. This model was then fine-tuned on our proprietary
datasets using its default hyperparameter settings, with each training stage comprising 50 epochs.
Gemini 2.5 Flash: Gemini 2.5 Flash was accessed via the Vertex AI API, incorporating both few-shot
learning and answer shufling techniques. For few-shot learning, the number of in-context examples
provided with each query was set to be equal to the total number of questions in the evaluation set. In
this experiment, this translated to 27 few-shot examples accompanying each inference request. This
approach aims to enhance prediction accuracy by providing relevant ground-truth examples as context.
For answer shufling, the order of answer options was permuted twice for each sample. Consequently,
the model generated three outputs per sample: one with the original answer order and two with distinct
shufled orders.
          </p>
          <p>Ensemble: The ensemble method aggregated outputs from multiple model inferences. Final predictions
were determined using hard voting, where the answer option with the highest frequency among the
collected outputs was selected. We experimented with three distinct ensemble configurations:
• MUMC combined with two Gemini 2.5 Flash inference runs (one using the original answer order,
and one using a single shufled answer order).
• MUMC combined with three Gemini 2.5 Flash inference runs (one original, and two using distinct
shufled answer orders).
• An ensemble of three Gemini 2.5 Flash inference runs (one original, and two using distinct shufled
answer orders).</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Experimental Results</title>
        <sec id="sec-5-2-1">
          <title>5.2.1. Evaluation Metrics</title>
          <p>We used the oficial evaluation metrics defined by the ImageCLEFmed MEDIQA-MAGIC 2025 organizers
for each task.</p>
          <p>Task 1: Segmentation Performance was measured by the Jaccard Index (IoU) and Dice Coeficient.
IoU = || ∩∪|| ; Dice = 2|×| |+ ∩|||
Where  is the predicted mask and  is the ground truth mask. The ground truth was constructed
using a ’Majority Vote’ from four annotators, where a pixel is considered positive if marked by at least
two annotators. The final reported scores are macro-averaged IoU and Dice across the test set.
Task 2: Closed Visual Question Answering The task was evaluated using Overlap-based Accuracy,
suitable for multi-label answers. The accuracy for each question i is calculated as: Accuracy_ =
|∩|
max(||,||)
Where  and  are the ground truth and predicted label sets, respectively. The final score is the mean
of these per-question accuracies across the entire test set.
5.2.2. Results
Task 1: Segmentation This section presents the performance of our submission for the
MEDIQAMAGIC 2025 segmentation task. Our approach utilized the DermoSegDif model, a state-of-the-art
difusion-based architecture for medical image segmentation. The model was evaluated on the unseen
private test set using the oficial competition metrics. The oficial results are summarized in the Table 1.
Task 2: Closed Visual Question Answering For private test phase, we contributed 4 piplines,
which detailed in Table 3. Gemini 2.5 Flash shows the eficientness that the ensemble contains three
Gemini inference runs achieved the highest private test score.</p>
          <p>Gemini 2.5 Flash (+Preprocessing)
Ensemble (MUMC + two Gemini 2.5 Flash runs) (+Preprocessing) (+Shufling)
Ensemble (MUMC + three Gemini 2.5 Flash runs) (+Preprocessing) (+Shufling)
Ensemble (three Gemini 2.5 Flash runs) (+Preprocessing) (+Shufling)</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>Overall Accuracy</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Works</title>
      <p>Limitations Despite the efectiveness of the proposed model, it exhibits notable limitations in handling
very small lesions or subtle variations in texture and color, which may be imperceptible to the human
eye or not be suficiently captured by the model without specialized training. The model can also
struggle to distinguish between lesion types with similar surface characteristics and may fail to fully
capture multi-attribute lesions (e.g., those presenting multiple colors or patterns). Additionally, the
encoder stage in the proposed difusion network may not suficiently extract comprehensive features
from both the input images and corresponding segmentation masks, potentially limiting segmentation
accuracy in complex cases.</p>
      <p>Future Works Future improvements can be focused on combining MUMC and Gemini 2.5 Flash
in a hybrid or ensemble model to improve both image analysis and language understanding, making
the system more accurate and reliable. Fine-tuning the models on a larger and more diverse medical
dataset, especially with rare skin conditions, could help improve generalization. Adjusting Gemini 2.5
Flash’s prompt format to better handle medical terms may also boost performance. To reduce bias and
increase realism, the dataset could be expanded with more context-based questions and images from
people of various races and age groups. Evaluation can be strengthened by using metrics such as BLEU,
ROUGE, and METEOR for free-form responses, as well as expert feedback from dermatologists. For
the segmentation part, testing stronger encoder models like TransUNet [31] and applying more data
augmentation can lead to better learning from the available data.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Appendices</title>
    </sec>
    <sec id="sec-8">
      <title>A. MUMC Architecture and Training Details</title>
      <p>
        The MUMC (Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive
Losses) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] framework operates in two main stages: pre-training and fine-tuning.
      </p>
      <p>Stage 1: Pre-training This stage aims to learn robust and highly generalizable representations from
large-scale medical image-caption datasets.</p>
      <p>1. Input Data Preparation:
• Image: Each image is divided into 16×16 patches. During training, 25% of the patches are
randomly masked, and only the unmasked patches are input to the image encoder. This
masked image modeling encourages the model to learn from partial visual information.
• Text (Caption): TAssociated text descriptions are tokenized using a WordPiece tokenizer.
2. Architecture: A dual-encoder setup based on Momentum Contrast (MoCo) is employed,
consisting of online and momentum encoders that stabilize contrastive training.
3. Optimization Objectives: The model is jointly optimized with four self-supervised losses:
• Contrastive Loss (UCL and MCL): The core objective, which structures the latent space by
pulling similar pairs together and pushing dissimilar pairs apart. Unimodal Contrastive Loss
(UCL) operates within a single modality (image-to-image, text-to-text), while Multimodal
Contrastive Loss (MCL) learns the alignment between modalities (image-to-text).
• Image-Text Matching (ITM) Loss: A binary classification loss that determines whether a
given image-caption pair is matched or randomly paired. This reinforces semantic alignment
between modalities beyond just contrastive structure.
• Masked Language Modeling (MLM) Loss: A language modeling objective where 15%
of input tokens in the caption are randomly masked and the model must predict them
using the remaining text and associated image features, thereby strengthening contextual
understanding
Stage 2: Fine-tuning The pretrained weights are transferred to a VQA model with a
Transformerbased answering decoder (6 layers) that generates free-text answers from fused image-question
embeddings. The model is fine-tuned using a standard cross-entropy loss over the target answer sequences.</p>
    </sec>
    <sec id="sec-9">
      <title>B. Prompt for Gemini</title>
      <p>This prompt utilizes a few-shot learning technique, wherein the model is provided with 27 complete
examples from the training set before it is asked to process a new case. The primary objective of
this structured prompt is to strictly constrain the model’s output to a single integer corresponding
to the chosen answer’s index, thereby simplifying the results parsing process and increasing output
consistency.</p>
      <sec id="sec-9-1">
        <title>B.1. Prompt Structure</title>
        <p>The prompt is constructed following a "guidance - examples - task" architecture. The full prompt
template is provided below.</p>
        <p>You are an expert dermatologist. Your task is to
answer a multiple-choice question about a clinical
case based on the provided clinical context and
image(s).</p>
        <p>You will be shown several examples first, followed by
a new case to solve.</p>
        <p>Your response MUST be a single integer representing
the index of the correct answer.</p>
        <p>DO NOT include any other text, explanation, or
formatting. JUST THE NUMBER.
--- EXAMPLE 1 START
--[IMAGE(S) ARE PROVIDED HERE]
Clinical Context: {Example 1 Context}
Question: {Example 1 Question}
Options:
0: {Option 0}
1: {Option 1}
...</p>
        <p>Correct Answer Index:
{Example 1 Correct Index}
--- EXAMPLE 1 END
----- EXAMPLE 2 START
--[IMAGE(S) ARE PROVIDED HERE]
Clinical Context: {Example 2 Context}
Question: {Example 2 Question}
Options:
0: {Option 0}
1: {Option 1}
...</p>
        <p>Correct Answer Index:
{Example 2 Correct Index}
--- EXAMPLE 2 END
--... (25 more examples follow the same structure) ...
--- EXAMPLE 27 START
--[IMAGE(S) ARE PROVIDED HERE]
Clinical Context: {Example 27 Context}
Question: {Example 27 Question}
Options:
0: {Option 0}
1: {Option 1}
...</p>
        <p>Correct Answer Index:
{Example 27 Correct Index}
--- EXAMPLE 27 END
----- YOUR TASK START
--Now, analyze the following new case and provide your answer as
a single integer.
[IMAGE(S) ARE PROVIDED HERE]
Clinical Context: {Query Case Context}
Question: {Query Case Question}
Options:
0: {Query Option 0}
1: {Query Option 1}
2: {Query Option 2}
3: {Query Option 3}
Answer Index:
B.2. Explanation of Prompt Components
• Role Definition : The initial line, "You are an expert dermatologist," establishes a specific persona
and domain expertise for the model. This helps the model to approach the problem from the
perspective of a medical specialist, potentially activating more relevant reasoning paths.
• Strict Output Instruction: The lines "Your response MUST be a single integer..." are capitalized
and emphasized to capture the model’s attention and ensure it understands the precise output
format required. This is the most critical change compared to previous prompt versions, as it
eliminates conversational or explanatory text that would complicate automated evaluation.
• Few-shot Examples: A total of 27 examples (from ‘EXAMPLE 1‘ to ‘EXAMPLE 27‘) are included
to "teach" the model the desired input-output format. Each example provides a complete instance,
including the full context, the image (passed as an image object, represented by ‘[IMAGE(S) ARE
PROVIDED HERE]‘), the question, the multiple-choice options, and the correct answer index.</p>
        <p>This in-context learning is crucial for guiding the model’s behavior without updating its weights.
• Query Task: The final section, beginning with ‘— YOUR TASK START —‘, is where the actual
data for the case to be predicted is inserted. Placeholders such as ‘Query Case Context‘ are
dynamically replaced with the real data. The final line, ‘Answer Index:‘, acts as a direct cue for
the model to complete the sequence and provide its numerical answer.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>C. Prompt for Knowledge Enrichment</title>
      <p>The core of the enrichment process is to design sophisticated prompt for a powerful generative model
(Gemini 2.5 Flash). The full prompt text is provided below to illustrate the detailed instructions given to
the model.</p>
      <p>ROLE: AI Medical Concept Enrichment Specialist
CONTEXT: You are tasked with processing text content,
typically image captions or descriptions
(query_content_en field) related to medical
observations, often within a Visual Question
Answering (VQA) context for medicine, especially
but not limited to Dermatology. Your goal is to
enhance this text by identifying specific
medical terms and appending concise, accurate
definitions. This prompt is designed for the Gemini
2.5 Flash model.</p>
      <p>OBJECTIVE: To enrich the input text by identifying
relevant medical terms (including but not limited to
Dermatology) and appending their brief, accurate
definitions immediately after the term, formatted as
Term [Definition]. Definitions for
Dermatology-specific
terms should prioritize consistency with DermNet NZ
(site:dermnetnz.org). Definitions for other medical
terms (diseases, symptoms, findings, anatomical
locations, procedures, relevant medications) should
be consistent with standard medical knowledge and
ontologies like SNOMED CT
(site:https://www.snomed.org/)
or UMLS (site:
https://www.nlm.nih.gov/research/umls/index.html)
or wikidoc
(site:
https://www.wikidoc.org/index.php/Main_Page),
use your crawl skills to get information.</p>
      <p>INPUT: A single string of text representing the value
of a query_content_en field or similar medical text
description.</p>
      <p>OUTPUT: The modified string of text, with relevant
medical terms enriched as specified. The overall
structure and non-relevant parts of the original
text must remain unchanged.</p>
      <p>CONSTRAINTS:
1. Scope: Enrich specific medical terms. This includes:
- Names of diseases or conditions (e.g., psoriasis).
- Specific symptoms or clinical findings (e.g.,
macule, erythema).
- Relevant anatomical locations (e.g., epidermis, dermis).
- Medical or surgical procedures (e.g., biopsy).</p>
      <p>- Commonly referenced medications (e.g., methotrexate).
2. Exclusion: Do NOT enrich:
- Highly general terms (e.g., ’disease’, ’patient’).
- Common non-medical words (e.g., ’tired’, ’left’, ’right’).</p>
      <p>- Terms already adequately explained by the context.
3. Source Prioritization:
- For Dermatology: Prioritize DermNet NZ
(site:dermnetnz.org).</p>
      <p>- For other medical terms: Use SNOMED CT, UMLS, Wikidoc.
4. Definition Format: Term [Concise, clear definition].
5. Definition Content: Brief, 1-2 short sentences.
6. Accuracy: Ensure definitions are medically accurate.
7. Case Sensitivity: Identify terms regardless of case,
but preserve</p>
      <p>original capitalization in the output.
8. No Modification Otherwise: Do not alter any other
part of the text.</p>
      <p>INSTRUCTIONS:
1. Receive the input text string.
2. Scan the text to identify potential medical keywords.
3. For each term:
a. Verify it meets enrichment criteria
(CONSTRAINT 1 &amp; 2).
b. Determine if it is dermatological or general medical.
c. Generate a concise definition per Source Prioritization.
d. Format the definition as specified (CONSTRAINT 4 &amp; 5).</p>
      <p>e. Append the formatted definition to the term.
4. If no relevant terms are found, return the original text.
5. Return the fully processed text string.</p>
      <p>EXAMPLES:
Input Text (Dermatology Focus): The patient presented
with severe psoriasis and was prescribed methotrexate.</p>
      <p>Output Text: The patient presented with severe psoriasis
[a common,
chronic inflammatory skin disease characterized by red,
itchy, scaly
patches] and was prescribed methotrexate [an
immunosuppressant drug...].</p>
      <p>Input Text (General Medical): Image shows pitting edema on
the lower leg.</p>
      <p>Output Text: Image shows pitting edema [swelling,
typically in the limbs, where pressing the skin
leaves a temporary indentation] on the lower leg.</p>
      <p>Input Text (Exclusion): Doctors predict that he
has some kind of infection, he feels tired.</p>
      <p>Output Text: Doctors predict that he has some kind of
infection [invasion and multiplication of
microorganisms...], he feels tired.
(Note: ’predict’, ’tired’ are not enriched).</p>
      <p>Now, process the following input text based on these
instructions:</p>
      <sec id="sec-10-1">
        <title>C.1. Analysis of Prompt Architecture and Efectiveness</title>
        <p>The prompt’s design is multi-faceted, aiming to transform a powerful but general-purpose language
model into a precise and reliable medical annotation tool. Each component plays a strategic role in
achieving this goal.</p>
        <p>• Role-Playing (‘ROLE‘): The prompt assigns the model the persona of an ‘"AI Medical Concept
Enrichment Specialist"‘. This technique is crucial for setting the context. It shifts the model from
a general conversational mode to a professional, domain-specific one, encouraging it to access
and utilize its training data related to medical science and formal writing styles.
• Zero-Shot, Instruction-Following (‘INSTRUCTIONS‘): The core of the prompt is a detailed,
algorithmic set of instructions. Rather than relying on the model to infer the task from examples
alone (few-shot), it explicitly defines the procedure: scan, identify, verify, generate, format, and
return. This converts a potentially ambiguous creative task into a more deterministic, rule-based
process, significantly increasing the reliability and consistency of the output.
• Strict Constraints and Negative Logic (‘CONSTRAINTS‘): A key to high-precision output is
defining not only what to do, but also what not to do.</p>
        <p>– Inclusion Scope: By listing categories of terms to enrich (diseases, symptoms, anatomy),
the prompt focuses the model’s attention on high-value, specific medical concepts that carry
significant diagnostic weight.
– Exclusion Scope: The negative constraints (e.g., excluding ’patient’, ’doctor’, ’left’) are vital
for preventing "over-enrichment". Without these rules, the model might define common
words, cluttering the output and diluting the importance of the truly significant medical
terms. This improves the signal-to-noise ratio of the enriched caption.
• Knowledge Grounding and Source Prioritization: To prevent model "hallucination" or
inaccurate definitions, the prompt explicitly grounds the required knowledge in authoritative
external sources. By instructing the model to prioritize definitions consistent with DermNet NZ
for dermatology and established ontologies like SNOMED CT or WikiDoc for general medicine,
we guide the model to generate factually accurate and contextually appropriate information.
The instruction to ‘"use your crawl skills"‘ leverages the model’s ability to access and synthesize
information from its vast training data, which includes these reliable web sources.
• Format Enforcement via Examples: While the prompt is primarily instruction-based, it
includes a set of clear ‘EXAMPLES‘. These serve a critical function: they demonstrate the exact
implementation of all the preceding rules, especially the strict output format of ‘Term [Definition]‘.
The examples cover diverse cases, including dermatology-specific terms, general medical terms,
and a case showing correct exclusion, leaving no ambiguity about the expected output. This
combination of explicit instructions and illustrative examples is a powerful technique for ensuring
the model adheres to the desired schema.</p>
        <p>In conclusion, the efectiveness of our knowledge enrichment pipeline is not merely due to using a
powerful LLM, but is a direct result of this carefully crafted prompt. It strategically combines role-playing,
explicit instructions, positive and negative constraints, knowledge grounding, and clear examples to
transform a general-purpose tool into a specialized, reliable, and highly efective component of our
medical VQA system.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, we used Gemini 2.5 Flash and ChatGPT-3.5 in order to: check
grammar and sentence structure. After using these tools, we reviewed and edited the content as needed
and take full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-12">
      <title>Acknowledgments References</title>
      <p>This research is funded by University of Information Technology-Vietnam National University
HoChiMinh City under grant number D4-2025-04.
[28] J. J. Lau, S. Gayen, D. Demner, A. Ben Abacha, Visual question answering in radiology (VQA-RAD),
2022.
[29] X. He, Z. Cai, W. Wei, Y. Zhang, L. Mou, E. Xing, P. Xie, Towards visual question answering
on pathology images, in: C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the 59th
Annual Meeting of the Association for Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing (Volume 2: Short Papers), Association for
Computational Linguistics, Online, 2021, pp. 708–718. URL: https://aclanthology.org/2021.acl-short.
90/. doi:10.18653/v1/2021.acl-short.90.
[30] B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y. Yang, X.-M. Wu, Slake: A semantically-labeled
knowledgeenhanced dataset for medical visual question answering, 2021. URL: https://arxiv.org/abs/2102.
09542. arXiv:2102.09542.
[31] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, Y. Zhou, Transunet: Transformers
make strong encoders for medical image segmentation, 2021. URL: https://arxiv.org/abs/2102.04306.
arXiv:2102.04306.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <article-title>Overview of the mediqa-magic task at imageclef 2025: Multimodal and generative telemedicine in dermatology</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Madrid, Span,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bozorgpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sadegheih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kazerouni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Azad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Merhof</surname>
          </string-name>
          ,
          <article-title>Dermosegdif: A boundaryaware segmentation difusion model for skin lesion delineation</article-title>
          , in: I.
          <string-name>
            <surname>Rekik</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Adeli</surname>
            ,
            <given-names>S. H.</given-names>
          </string-name>
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cintas</surname>
          </string-name>
          , G. Zamzmi (Eds.),
          <source>Predictive Intelligence in Medicine - 6th International Workshop</source>
          , PRIME 2023,
          <article-title>Held in Conjunction with MICCAI 2023, Vancouver</article-title>
          , BC, Canada, October 8,
          <year>2023</year>
          , Proceedings, volume
          <volume>14277</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2023</year>
          , pp.
          <fpage>146</fpage>
          -
          <lpage>158</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -46005-0_
          <fpage>13</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -46005-0\_
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          , G. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <article-title>Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering</article-title>
          , in: H.
          <string-name>
            <surname>Greenspan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Madabhushi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mousavi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Salcudean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Duncan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Syeda-Mahmood</surname>
            ,
            <given-names>R</given-names>
          </string-name>
          . Taylor (Eds.),
          <source>Medical Image Computing and Computer Assisted Intervention - MICCAI 2023</source>
          , Springer Nature Switzerland, Cham,
          <year>2023</year>
          , pp.
          <fpage>374</fpage>
          -
          <lpage>383</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. B. J.-B. A. e. a. Gemini</given-names>
            <surname>Team</surname>
          </string-name>
          , Rohan Anil,
          <article-title>Gemini: A family of highly capable multimodal models</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2312.11805. arXiv:
          <volume>2312</volume>
          .
          <fpage>11805</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C. H.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation</article-title>
          , in: K. Chaudhuri,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jegelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szepesvári</surname>
          </string-name>
          , G. Niu, S. Sabato (Eds.),
          <source>International Conference on Machine Learning</source>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          <year>2022</year>
          ,
          <volume>17</volume>
          -
          <issue>23</issue>
          <year>July 2022</year>
          , Baltimore, Maryland, USA, volume
          <volume>162</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>12888</fpage>
          -
          <lpage>12900</lpage>
          . URL: https://proceedings.mlr.press/v162/li22n.html.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Nguyen-Tat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-A.</given-names>
            <surname>Vo</surname>
          </string-name>
          , P.-S. Dang,
          <article-title>Qmaxvit-unet+: A query-based maxvit-unet with edge enhancement for scribble-supervised segmentation of medical images</article-title>
          ,
          <source>Computers in Biology and Medicine</source>
          <volume>187</volume>
          (
          <year>2025</year>
          )
          <article-title>109762</article-title>
          . URL: http://dx.doi.org/10.1016/j.compbiomed.
          <year>2025</year>
          .
          <volume>109762</volume>
          . doi:
          <volume>10</volume>
          . 1016/j.compbiomed.
          <year>2025</year>
          .
          <volume>109762</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Nguyen-Tat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-Q. T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-N.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Ngo</surname>
          </string-name>
          ,
          <article-title>Enhancing brain tumor segmentation in mri images: A hybrid approach using unet, attention mechanisms, and transformers</article-title>
          ,
          <source>Egyptian Informatics Journal</source>
          <volume>27</volume>
          (
          <year>2024</year>
          )
          <article-title>100528</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/ S1110866524000914. doi:https://doi.org/10.1016/j.eij.
          <year>2024</year>
          .
          <volume>100528</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Nguyen-Tat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Q.</given-names>
            <surname>Hung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. T.</given-names>
            <surname>Nam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Ngo</surname>
          </string-name>
          ,
          <article-title>Evaluating pre-processing and deep learning methods in medical imaging: Combined efectiveness across multiple modalities</article-title>
          ,
          <source>Alexandria Engineering Journal</source>
          <volume>119</volume>
          (
          <year>2025</year>
          )
          <fpage>558</fpage>
          -
          <lpage>586</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/ S1110016825001176. doi:https://doi.org/10.1016/j.aej.
          <year>2025</year>
          .
          <volume>01</volume>
          .090.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Datla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <article-title>Vqa-med: Overview of the medical visual question answering task at imageclef 2019</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2019</year>
          . URL: https://api.semanticscholar.org/CorpusID:198489641.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolutional networks for large-scale image recognition, 2015</article-title>
          . URL: https://arxiv.org/abs/1409.1556. arXiv:
          <volume>1409</volume>
          .
          <fpage>1556</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <year>2015</year>
          . URL: https: //arxiv.org/abs/1512.03385. arXiv:
          <volume>1512</volume>
          .
          <fpage>03385</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13] W.-w. Yim,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Krallinger, Overview of the MEDIQA-M3G 2024 shared task on multilingual multimodal medical answer generation</article-title>
          , in: T.
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A. Ben</given-names>
          </string-name>
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Roberts</surname>
          </string-name>
          , D. Bitterman (Eds.),
          <source>Proceedings of the 6th Clinical Natural Language Processing Workshop</source>
          , Association for Computational Linguistics, Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>581</fpage>
          -
          <lpage>589</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .clinicalnlp-
          <volume>1</volume>
          .55/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .clinicalnlp-
          <volume>1</volume>
          .
          <fpage>55</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2103.00020. arXiv:
          <volume>2103</volume>
          .
          <fpage>00020</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Anthropic</surname>
          </string-name>
          , Claude 3 family, https://www.anthropic.com/news/claude-3-family,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -04-24.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xie</surname>
          </string-name>
          , Pmc-vqa:
          <article-title>Visual instruction tuning for medical visual question answering</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2305.10415. arXiv:
          <volume>2305</volume>
          .
          <fpage>10415</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Siddique</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paheding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Elkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Devabhaktuni</surname>
          </string-name>
          ,
          <article-title>U-net and its variants for medical image segmentation: A review of theory and applications</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>82031</fpage>
          -
          <lpage>82057</lpage>
          . URL: http://dx.doi.org/10.1109/ACCESS.
          <year>2021</year>
          .
          <volume>3086020</volume>
          . doi:
          <volume>10</volume>
          .1109/access.
          <year>2021</year>
          .
          <volume>3086020</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. R. Siddiquee</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Tajbakhsh</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
          </string-name>
          , Unet++
          <article-title>: A nested u-net architecture for medical image segmentation</article-title>
          ,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1807</year>
          .10165. arXiv:
          <year>1807</year>
          .10165.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iwamoto</surname>
          </string-name>
          , X. Han,
          <string-name>
            <given-names>Y.-W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <source>J. Wu, Unet</source>
          <volume>3</volume>
          +
          <article-title>: A fullscale connected unet for medical image segmentation</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2004</year>
          .08790. arXiv:
          <year>2004</year>
          .08790.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          , S. Hoi,
          <article-title>BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation</article-title>
          , in: ICML,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Afvari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Song,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. Gao,</surname>
          </string-name>
          <article-title>SkinCAP: A Multi-modal Dermatology Dataset Annotated with Rich Medical Captions</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2405</volume>
          .
          <fpage>18004</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-net:
          <article-title>Convolutional networks for biomedical image segmentation</article-title>
          ,
          <year>2015</year>
          . URL: https://arxiv.org/abs/1505.04597. arXiv:
          <volume>1505</volume>
          .
          <fpage>04597</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <article-title>Dermavqa-das: Dermatology assessment schema (das) and datasets for closed-ended question answering and segmentation in patient-generated dermatology images</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24] W. wai Yim,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen-Yildiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <article-title>Dermavqa: A multilingual visual question answering dataset for dermatology</article-title>
          , in: International Conference on Medical Image Computing and
          <string-name>
            <surname>Computer-Assisted Intervention</surname>
          </string-name>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koitka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nensa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <article-title>Radiology objects in context (roco): A multimodal image dataset</article-title>
          , in: D.
          <string-name>
            <surname>Stoyanov</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
            , S. Balocco,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Sznitman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martel</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Maier-Hein</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
            , G. Zahnd,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Demirci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Albarqouni</surname>
            ,
            <given-names>S.-L.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Moriconi</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Cheplygina</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Mateus</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Trucco</surname>
          </string-name>
          , E. Granger, P. Jannin (Eds.),
          <source>Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis</source>
          , Springer International Publishing, Cham,
          <year>2018</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>S. M.-B. B. M. v. Z. S. P. S. S. M. G. Sanjay</surname>
            <given-names>Subramanian</given-names>
          </string-name>
          , Lucy Lu Wang, H. Hajishirzi,
          <article-title>MedICaT: A Dataset of Medical Images, Captions, and Textual References</article-title>
          , in: Findings of EMNLP,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            , L. Bloch,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Brüngel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Idrissi-Yaghir</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schäfer</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>M. Friedrich, Overview of imageclefmedical 2022 - caption prediction and concept detection</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          , M. Potthast (Eds.),
          <source>Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum</source>
          , Bologna, Italy, September 5th - to - 8th,
          <year>2022</year>
          , volume
          <volume>3180</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1294</fpage>
          -
          <lpage>1307</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3180</volume>
          /paper-95.pdf.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>