<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IReL, IIT(BHU) at MEDIQA-MAGIC 2025: Tackling Multimodal Dermatology with CLIPSeg-Based Segmentation and BERT-Swin Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Krishna Tewari</string-name>
          <email>krishnatewari.rs.cse24@itbhu.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhyudaya Verma</string-name>
          <email>abhyudaya.student.cse21@itbhu.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sukomal Pal</string-name>
          <email>spal.cse@itbhu.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology(BHU) Varanasi</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Advances in multimodal learning have the potential to significantly improve automated analysis of dermatological images by integrating visual and textual clinical information. In this work, we present IReL, IIT(BHU)'s system developed for the MEDIQA-MAGIC 2025 challenge, addressing two tasks: lesion segmentation and CVQA. For segmentation, we propose a CLIPSeg-based framework that combines clinical images with contextual prompts formed by consumer questions and clinician responses. Using frozen CLIP encoders and a fine-tuned transformer decoder, our system produces detailed lesion masks being among top performing team by achieving a Dice score of 0.741 and a Jaccard score of 0.588. These results demonstrate the efectiveness of prompt-guided vision-language models in generating clinically meaningful segmentation outputs. In the VQA task, we integrate Bio_ClinicalBERT and a Swin Transformer to encode textual and visual inputs, respectively. While the model underperformed (accuracy 0.1731), likely due to suboptimal input alignment, it establishes a foundation for future enhancements. Our findings underscore the strength of vision-language fusion for dermatological segmentation and indicate that targeted improvements in multimodal alignment and input formatting could substantially improve VQA performance. Overall, this work highlights the promise of multimodal architectures in advancing intelligent clinical decision support.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Segmentation</kwd>
        <kwd>Multimodal Dermatology</kwd>
        <kwd>Visual Question Answering</kwd>
        <kwd>MEDIQA-MAGIC 2025</kwd>
        <kwd>ClipSeg</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Dermatological disorders constitute a substantial portion of global disease burden, with skin conditions
afecting nearly one in three individuals at some point in their lifetime 1. Accurate diagnosis of these
conditions often depends on a combination of visual inspection and clinical context, posing a multimodal
challenge in healthcare. With the proliferation of patient-generated dermatology images and natural
language interactions via telehealth platforms, there is an urgent need for intelligent systems capable
of jointly analyzing textual and visual data.</p>
      <p>
        To address this gap, the MEDIQA-MAGIC 2025 challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] introduced a two-pronged benchmark for
multimodal dermatology. The first subtask focuses on lesion segmentation, where the objective is to
generate dense pixel-wise masks of dermatological anomalies using both images and contextual prompts.
The second subtask addresses closed-domain visual question answering (CVQA), requiring models
to select the correct answer from multiple choices given a medical image and a related clinical question.
      </p>
      <p>
        For Subtask 1, we utilize CLIPSeg [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a vision-language segmentation model that integrates
textual prompts with visual features through a frozen CLIP backbone and a transformer decoder. By
incorporating contextual cues formed by concatenating patient questions and clinician responses, our
system is able to produce clinically meaningful lesion masks. For Subtask 2, we employ a dual-encoder
architecture using Bio_ClinicalBERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for textual encoding and Swin Transformer [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for image
representation. These encoders project their respective modalities into a shared embedding space to
support answer classification.
      </p>
      <p>The rest of the paper is structured as follows. Section 2 provides a review of related work. Section 3
introduces the datasets. Section 4 elaborates on our implementation strategies. Section 5 discusses the
results obtained. Finally, Section 6 ofers concluding insights and suggests directions for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Multimodal learning has emerged as a promising direction in dermatological AI by enabling the
integration of visual and textual data for context-aware diagnosis and segmentation. While early eforts
in medical visual question answering (VQA) focused on radiology and pathology [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], recent work has
adapted these frameworks to dermatology, addressing its unique visual and semantic challenges.
      </p>
      <p>
        In lesion segmentation, the ISIC (International Skin Imaging Collaboration) challenges [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] have been
instrumental in shaping benchmarks for melanoma detection. These competitions drove the adoption
of convolutional models like U-Net [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which remains a cornerstone in medical image segmentation
due to its skip connections and encoder-decoder design that retain spatial resolution. However, such
purely visual models often fail to integrate clinical context, limiting their diagnostic interpretability.
      </p>
      <p>
        To overcome this limitation, multimodal transformers such as LXMERT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and ViLT [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] have been
applied to align visual and textual inputs. These models support joint reasoning tasks like captioning and
VQA, but their general-domain pretraining constrains their efectiveness in clinical settings.
Domainspecific adaptations, such as BioViL-T [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], have improved upon this by fine-tuning with biomedical
corpora, enhancing performance in tasks like image-text retrieval and clinical reporting.
      </p>
      <p>
        Simultaneously, synthetic data generation has become vital in dermatology to address limitations in
dataset availability and privacy. Generative models like StyleGAN2 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] can synthesize realistic lesion
images, though their fidelity depends heavily on careful tuning. More recent approaches [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] condition
image generation on clinical text prompts, ensuring better semantic relevance and diagnostic utility.
      </p>
      <p>Despite recent progress, most approaches treat lesion segmentation and VQA as separate tasks,
optimizing each in isolation. While unified multimodal systems are a promising goal, addressing these
tasks independently allows for specialized architectures and task-specific tuning. In this work, we adopt
separate transformer-based models for segmentation and VQA, enabling targeted advancements in each
area while contributing to the development of robust and interpretable dermatological AI solutions.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Details</title>
      <p>
        The MEDIQA-MAGIC 2025 dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] includes thousands of high-resolution clinical dermatology
images collected in real-world settings. Each encounter contains between 1 to 5 RGB images depicting
various skin lesions under natural lighting conditions. These images are captured from diferent
angles and are often consumer-generated, mimicking real clinical workflows. For each image, expert
dermatologists provided pixel-wise lesion segmentation masks. Up to three masks may be available per
image, created by independent annotators from a pool of four. The masks follow a standardized file
naming convention and are stored in TIFF format. Each clinical encounter is associated with one or
more templated multiple-choice questions.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>In this section, we present the methodology and model architecture employed for both subtasks.</p>
      <sec id="sec-4-1">
        <title>4.1. Segmentation (Subtask 1)</title>
        <p>This section details our vision-language pipeline for identifying dermatological lesion regions in clinical
images. As summarised in Figure 1, the workflow combines textual prompts with image features
through a pre-trained CLIP backbone and a fine-tuned CLIPSeg decoder to yield pixel-accurate lesion
masks.</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Data Pre-processing and Prompt Construction</title>
          <p>
            We used the DermaVQA corpus [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], where every clinical image is accompanied by up to three
segmentation masks, each created by a dermatologist chosen from four annotators {ann0, ann1, ann2, ann3}.
Masks follow the pattern:
          </p>
          <p>IMG_{ENCOUNTERID}_{IMAGEID}_mask_{ann#}.tiff.</p>
          <p>If a given annotator did not label an image, the corresponding file is absent. To obtain a single
groundtruth mask, we perform a pixel-wise logical OR across all available annotator masks (typically three
per image). Images are converted to RGB, resized to 352 × 352 (CLIPSeg default) and normalised
with CLIP’s ImageNet statistics. For the language stream, we concatenate the consumer question and
clinician answer:</p>
          <p>Prompt = Question ‖ Answer,
then strip HTML tags and excessive whitespace.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Tokenisation and CLIPSegProcessor</title>
          <p>Text prompts are tokenized using CLIP’s tokenizer, producing token IDs and an attention mask to
distinguish real tokens from padding. Simultaneously, input images are resized and normalized by the vision
processor to fit CLIP’s vision transformer requirements. The Hugging Face CLIPSegProcessor combines
these steps, outputting a dictionary with input_ids, attention_mask, and pixel_values, all
properly padded, truncated, and normalized for seamless model input.</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>4.1.3. Embeddings Extraction and Model Architecture</title>
          <p>
            We adopt the CLIPSeg framework [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ], which builds on CLIP’s dual-encoder architecture. The frozen
text transformer encodes the input prompt into a fixed-length global text embedding that captures
its semantic meaning. Simultaneously, the frozen vision transformer divides the image into patches,
encoding each into feature vectors that retain spatial information.
          </p>
          <p>To enable multimodal reasoning, both embeddings are projected into a shared latent space via
learned linear layers. The global text embedding is broadcast and concatenated with each image patch
embedding, forming a multimodal token sequence. This fused representation, combining semantic and
spatial cues, is passed to a lightweight transformer decoder with about 6.5 million trainable parameters.</p>
          <p>The decoder produces a dense per-pixel logits map, indicating each pixel’s probability of belonging
to the prompt-specified target region. During training and inference, the CLIP backbone remains frozen
to retain pretrained knowledge, while only the decoder is fine-tuned, ensuring computational eficiency
and reducing overfitting risks.</p>
        </sec>
        <sec id="sec-4-1-4">
          <title>4.1.4. Decoder Fine-tuning and Training</title>
          <p>The output of the CLIPSeg decoder is a single-channel logits map of the same spatial dimensions as the
input image (i.e., 352 × 352). To convert these raw logits into interpretable probabilities, we apply a
sigmoid activation function at each pixel location:
1
 =  () = 1 + −  ,
where  is the raw logit value at pixel , and  ∈ (0, 1) represents the model’s confidence that pixel 
belongs to the lesion region.</p>
          <p>For supervision, we use the Binary Cross-Entropy (BCE) loss, a standard choice for binary
segmentation tasks. BCE compares the predicted probability map against the binary ground-truth mask on a
pixel-by-pixel basis. The loss is computed as:</p>
          <p>1 ∑︁ [ log() + (1 − ) log(1 − )] ,
ℒBCE = −</p>
          <p>=1
where  ∈ {0, 1} is the ground-truth label for pixel ,  is the predicted probability, and  is the total
number of pixels. This formulation penalises incorrect predictions more heavily when the model is
confident, helping to stabilise learning and accelerate convergence.</p>
          <p>Optimization is performed using the AdamW optimizer and we use a fixed learning rate of 9− 4,
chosen based on initial grid search experiments. To ensure stable gradients and avoid numerical
instabilities, we apply gradient clipping with a maximum norm of 1.0. This is particularly useful when
training with mixed precision, where dynamic range limitations in float16 can cause gradients to explode
in rare cases. Training is conducted for 20 to 30 epochs, with early stopping based on the validation
Dice score.</p>
        </sec>
        <sec id="sec-4-1-5">
          <title>4.1.5. Implementation Details</title>
          <p>
            The segmentation pipeline is implemented in PyTorch [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ], with preprocessing and model components
integrated via Hugging Face’s transformers and datasets APIs. Training and evaluation are conducted
on a single NVIDIA GPU with CUDA acceleration. Batches of 32 examples are sampled using a PyTorch
DataLoader. Each contains a clinical dermatology image, a textual prompt, and a binary segmentation
mask formed by merging annotator masks. To maximize GPU throughput, 4–8 data loading workers
are used for parallelized I/O, aiding especially with larger batches.
          </p>
          <p>We employ the CIDAS/clipseg-rd64-refined model from the Hugging Face Hub, featuring a
pretrained CLIP backbone and a transformer-based segmentation decoder. The CLIP encoders are
frozen while only the decoder is fine-tuned, reducing trainable parameters and improving generalization
with limited dermatological data. Mixed-precision training is enabled using PyTorch AMP, storing
most activations in float16 while preserving stability through selective float32 usage. This significantly
improves training eficiency. For reproducibility, random seeds are fixed across NumPy, PyTorch, and
CUDA, and deterministic operations are enforced.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. CVQA (Subtask 2)</title>
        <p>This section details our multimodal pipeline for the CVQA task as shown in Figure 2, where the objective
is to select the correct answer from a predefined set of options given a clinical image and a corresponding
question.</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. Dataset Composition and Instance Formatting</title>
          <p>We use the CLEF-MAGIC 2025 dataset, where each encounter consists of 1-5 clinical images and a set of
templated multiple-choice questions. Each question  has  candidate answers {1, 2, ..., }, with
only one correct label.</p>
          <p>To formulate the inputs, we first identify all images associated with an encounter, and construct
paired sequences by concatenating the question with each answer option:</p>
          <p>Input =  ‖  ,
∀ ∈ {1, ..., }.</p>
          <p>Each such tuple is linked to all images from the encounter. When multiple images are present, we
average their embeddings (after encoding) to form a unified visual context. This strategy preserves the
collective diagnostic content of the encounter while simplifying input dimensionality.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Preprocessing Pipeline</title>
          <p>All image-question-option pairs are processed via Hugging Face’s unified AutoProcessor interface,
which wraps both the tokenizer and image feature extractor for compatibility with the model
architecture.</p>
          <p>Each clinical image is converted to RGB format and resized to 224 × 224 pixels. Normalization is
applied using ImageNet mean and standard deviation statistics to match the expected input distribution
of the Swin Transformer backbone. Question-option strings are tokenized using the BERT tokenizer with
truncation and padding to a maximum sequence length of 128 tokens. The tokenizer outputs input_ids
and attention_mask, which are used to mask padding tokens during attention computation.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. Model Architecture</title>
          <p>
            Our architecture consists of a dual-stream vision-language encoder, followed by a scoring module that
computes relevance scores over candidate answers. We use Bio_ClinicalBERT [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] to encode medical
question-option strings into dense representations. The final [CLS] token embedding is extracted as the
global textual representation. Images are encoded using the Microsoft’s SWIN [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] vision transformer,
yielding patch-level embeddings which are mean-pooled to obtain a global image vector.
          </p>
          <p>Each pair of textual and image embeddings is projected via separate linear layers into a shared latent
space. These projected embeddings are concatenated and passed through a feedforward classification
head, which outputs a scalar compatibility score:</p>
          <p>= FFN([vtext; vimg]).</p>
          <p>For each question, the option with the highest score is selected as the predicted answer:</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>4.2.4. Training Objective and Optimization</title>
          <p>The model is trained as a -way classifier using the standard cross-entropy loss:
ˆ = arg max  .</p>
          <p>=1
ℒ = − log
︃(</p>
          <p>exp(* )
∑︀=1 exp( )
)︃
,
where * is the index of the correct answer. Training is conducted using the AdamW optimizer with a
learning rate of 5− 4,  1 = 0.9,  2 = 0.999, and a weight decay of 0.01.</p>
          <p>Each batch contains 2 question instances, where each instance includes  fused input pairs (one
for each answer option). Training is performed for 10 epochs, with early stopping monitored on the
validation F1-score. Dropout is applied with  = 0.1 after projection layers. Gradient clipping is
employed with a max norm of 1.0 to ensure training stability. Mixed-precision training is enabled using
PyTorch AMP for memory and computational eficiency.</p>
          <p>All components are implemented in PyTorch using the Hugging Face Transformers and Datasets
libraries. Data loading is parallelized using 4 workers. Visual and textual inputs are managed using
custom Dataset and DataCollator classes. The final model is checkpointed using validation-based
saving, and deterministic training is enforced via fixed random seeds.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>used corresponding to both subtasks.</p>
      <sec id="sec-5-1">
        <title>5.1. Evaluation Metrics</title>
        <p>This section presents the experimental results for the proposed approach and the evaluation metrics
Performance of the segmentation task is quantitatively assessed using two commonly adopted
overlapbased similarity measures: the Dice coeficient and the Jaccard index (Intersection over Union).
• Dice Coeficient:</p>
        <p>Also known as the F1 score for segmentation, this metric quantifies the overlap
between the predicted mask and the ground truth. It is calculated as:</p>
        <p>Dice =
2 · |  ∩ |
|| + ||
Jaccard = | ∩ |</p>
        <p>| ∪ |
where  and  are the sets of pixels in the predicted and ground-truth masks, respectively. The
Dice score ranges from 0 (no overlap) to 1 (perfect overlap) [16].
• Jaccard Index: Also referred to as Intersection over Union (IoU), this metric measures the
proportion of shared elements between the predicted and ground-truth masks relative to their
union. It is defined as:
with the same definitions for  and  as above. Like the Dice coeficient, the Jaccard index ranges
from 0 to 1, with higher values indicating more accurate segmentations [17].</p>
        <p>The performance of the CVQA task is evaluated using metrics that capture overall accuracy.
• Accuracy: This metric quantifies the proportion of correct predictions by measuring the overlap
between the predicted and gold answer sets for each question. For each question instance, the
intersection over maximum length (IoM) between the predicted and ground truth answer sets is
computed. The final accuracy is the average of these IoM scores across all instances. Formally,
for a set of  instances:</p>
        <p>Accuracy =
1 ∑︁</p>
        <p>|Pred ∩ Gold|
 =1 max(|Pred|, |Gold|)
where Pred and Gold denote the predicted and ground truth answer sets for the th instance,
respectively.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results</title>
        <p>models. Overall, the high similarity scores places IReL, IIT(BHU) among the top-performing teams. This
demonstrates the strength of our approach, which leveraged fine-tuning of a segmentation decoder with
clinical context embedding, leading to both accurate and clinically meaningful segmentation results.</p>
        <p>Figure 3 shows a histogram that illustrates the distribution of the Jaccard Index (Intersection over
Union, IoU) computed per image on the validation dataset (see Figure X). This visualization provides
insight into segmentation performance on an individual image level, revealing the variance across
samples. The distribution indicates that while a substantial number of images achieve high IoU scores
(above 0.75), reflecting strong segmentation accuracy, there also exists a long tail of lower-performing
cases. These lower IoU values may be attributed to images with complex anatomical structures,
occlusions, or limited contrast, which pose inherent challenges to the model. By analyzing this distribution,
we demonstrate both the robustness of our segmentation method on a majority of cases and identify
avenues for potential improvement in challenging scenarios.</p>
        <p>Table 2 reports the CVQA performance of diferent participating teams, evaluated using the accuracy
metric. This metric captures the model’s ability to correctly select the appropriate answer from multiple
options based on an input image and corresponding clinical question, reflecting joint visual-linguistic
reasoning performance.</p>
        <p>The top-performing team, Hoangwithhisfriends, achieved accuracy scores above 0.74 across multiple
submissions, indicating a robust pipeline capable of extracting and reasoning over fine-grained clinical
and visual cues. DS@GT MEDIQA-MAGIC and Kasukabe Defense Group followed with several strong
runs, with accuracy levels ranging from 0.71 to 0.66 in their best submissions.</p>
        <p>In contrast, our team, IReL, IIT(BHU), attained an accuracy of 0.1731, placing significantly lower
than other submissions. A potential reason for this performance gap lies in the preprocessing phase,
specifically the construction of question-image-option triplets. Our pipeline may not have aligned
the multiple-choice options with the image-question pairs efectively, impacting the model’s ability to
learn discriminative patterns. Additionally, errors or misalignments in candidate option formatting
could have weakened training supervision. Post-submission work will include addressing the identified
limitations by refining the alignment of image-question-option triplets and improving the formatting
consistency of candidate options, with the goal of enhancing training supervision and overall model
performance.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Error Analysis and Limitations</title>
        <p>QID Parent-Level Aggregation: Our current implementation evaluates each QID independently,
without grouping or aggregating predictions for questions that belong to the same QID parent. Given
that several clinical questions in the dataset are semantically related variants, this lack of aggregation
could obscure meaningful patterns or degrade accuracy. Future versions of the model will introduce:
• Aggregation of predictions across sibling QIDs under a common parent.
• Majority-vote or consensus fusion strategies for unified parent-level responses.</p>
        <p>• Hierarchical loss functions to optimize predictions jointly at child and parent levels.</p>
        <p>Encounter ID Alignment and Formatting Consistency: We also investigated potential issues
with data alignment. Although our evaluation script confirmed matching encounter_id sets between
ground truth and predictions, we suspect that inconsistencies in how input triplets (question, option,
image) were constructed may have contributed to low performance. For example: Candidate options may
not have been correctly mapped to corresponding images. To mitigate these issues, we have implemented
stricter ID alignment checks and refined our preprocessing to enforce consistent formatting. These
refinements are expected to enhance supervision quality and improve overall model performance in
future work.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>This paper presented our approach for the MEDIQA-MAGIC 2025 challenge, focusing on lesion
segmentation and visual question answering. We developed a vision-language segmentation model, aligning
with clinical interpretation by integrating textual and visual data, which improved lesion identification.
Although the question answering module underperformed, it highlighted challenges in multimodal
alignment, guiding future work on better fusion and domain-specific prompt design. Our findings
demonstrate the promise of multimodal learning to enhance automated dermatological analysis. By
advancing model design and clinical adaptation, we aim to support diagnostic accuracy and hope to
inspire further research in this evolving field.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, Grammarly in order to: Grammar and
spelling check, Paraphrase and reword. After using these tools, the authors reviewed and edited the
content as needed and take full responsibility for the publication’s content.
learning library, in: Advances in Neural Information Processing Systems (NeurIPS), volume 32,
2019.
[16] L. R. Dice, Measures of the amount of ecologic association between species, Ecology 26 (1945)
297–302.
[17] P. Jaccard, The distribution of the flora in the alpine zone, New Phytologist 11 (1912) 37–50.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <article-title>Overview of the mediqa-magic task at imageclef 2025: Multimodal and generative telemedicine in dermatology</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Madrid, Span,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lüddecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ecker</surname>
          </string-name>
          ,
          <article-title>Image segmentation using text and image prompts</article-title>
          ,
          <source>arXiv preprint arXiv:2112.10003</source>
          (
          <year>2021</year>
          ). URL: https://arxiv.org/pdf/2112.10003. arXiv:
          <volume>2112</volume>
          .
          <fpage>10003</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Alsentzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Boag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. H.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>McDermott, Publicly available clinical bert embeddings</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>03323</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          (ICCV),
          <year>2021</year>
          , pp.
          <fpage>10012</fpage>
          -
          <lpage>10022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ben Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shivade</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          Demner-Fushman,
          <article-title>Vqa-med: Overview of the medical visual question answering task at imageclef 2019</article-title>
          , in: CLEF (Working Notes),
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.-C.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Summers</surname>
          </string-name>
          ,
          <article-title>Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>IEEE Computer Society</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2497</fpage>
          -
          <lpage>2506</lpage>
          . URL: https://doi.org/10.1109/CVPR.
          <year>2016</year>
          .
          <volume>274</volume>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .
          <volume>274</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N. C. F.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rotemberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tschandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Celebi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Dusza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Gutman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Helba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kalloo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Liopyris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Marchetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kittler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Halpern</surname>
          </string-name>
          ,
          <article-title>Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic</article-title>
          ), arXiv preprint arXiv:
          <year>1902</year>
          .
          <volume>03368</volume>
          (
          <year>2019</year>
          ). URL: https://arxiv.org/abs/
          <year>1902</year>
          .03368.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-net:
          <article-title>Convolutional networks for biomedical image segmentation</article-title>
          ,
          <source>in: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)</source>
          , Springer,
          <year>2015</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <article-title>Lxmert: Learning cross-modality encoder representations from transformers</article-title>
          ,
          <source>in: EMNLP-IJCNLP</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Son</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kim</surname>
          </string-name>
          ,
          <article-title>Vilt: Vision-and-language transformer without convolution or region supervision</article-title>
          ,
          <source>in: Proceedings of the 38th International Conference on Machine Learning (ICML)</source>
          , volume
          <volume>139</volume>
          ,
          <string-name>
            <surname>PMLR</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>5583</fpage>
          -
          <lpage>5594</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Boecking</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bannur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Castro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schwaighofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hyland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wetscherek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Alvarez-Valle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Oktay</surname>
          </string-name>
          ,
          <article-title>Making the most of text semantics to improve biomedical vision-language processing</article-title>
          , in: Computer Vision - ECCV 2022, Springer,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -20059-
          <issue>5</issue>
          _
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Karras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Laine</surname>
          </string-name>
          , T. Aila,
          <article-title>Analyzing and improving the image quality of stylegan</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>8110</fpage>
          -
          <lpage>8119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>H. Wu,</surname>
          </string-name>
          <article-title>Text2skin: A conditional generative model for dermatological image synthesis from clinical text</article-title>
          ,
          <source>IEEE Transactions on Medical Imaging</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <article-title>Dermavqa-das: Dermatology assessment schema (das) and datasets for closed-ended question answering and segmentation in patient-generated dermatology images</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>DeVito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tejani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chilamkurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintala</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pytorch:</surname>
          </string-name>
          <article-title>An imperative style, high-performance deep</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>