<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Egyptian
Informatics Journal 27 (2024) 100528. URL: https://www.sciencedirect.com/science/article/pii/
S1110866524000914. doi:https://doi.org/10.1016/j.eij.2024.100528.
[24] T. B. Nguyen</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.3390/brainsci10120974</article-id>
      <title-group>
        <article-title>Anastasia at MEDIQA-MAGIC 2025: A Multi-Approach Segmentation Framework with Extensive Augmentation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tung Thanh Le</string-name>
          <email>tunglethanh0222@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tri Minh Ngo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Khoi Dinh Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Trung Hieu Dang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Trong Hoang Pham</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thien B. Nguyen-Tat</string-name>
          <email>thienntb@uit.edu.vn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Information Technology</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>33</volume>
      <fpage>18613</fpage>
      <lpage>18624</lpage>
      <abstract>
        <p>This study presents our approach for the ImageCLEFmedical 2025 Dermatological Segmentation task, focusing on the impact of data augmentation strategies for medical image segmentation. Our objective is to enhance segmentation accuracy through the combination of diverse augmentation techniques and post-processing refinements. We experimented with 20 augmentation techniques across three categories: geometric, photometric, and noise/artifact. Initially, we trained and evaluated various segmentation architectures, including pure CNNs, ViT-based models, and hybrid designs. Based on validation performance, we selected TransUNet with a hybrid ResNet-50 and ViT-B/16 backbone as the final model. This model was trained using datasets augmented by individual and full transformation strategies. We also experimented with MedSAM as a post-processing refinement, but it was applied only on predictions from the model trained on unaugmented data, resulting in no improvement in score. Experimental results showed that training with the full augmentation suite significantly improved segmentation outcomes, especially in challenging regions. These improvements were consistent across both validation and test datasets. Our findings demonstrate that integrating a hybrid CNN-transformer model with comprehensive augmentation creates a robust pipeline for dermatological image segmentation in clinical settings. Our method ranked first (Top-1) on the final leaderboard of the MEDIQA-MAGIC 2025 Segmentation Subtask.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;ImageCLEF 2025</kwd>
        <kwd>Dermatological Segmentation</kwd>
        <kwd>Data Augmentation</kwd>
        <kwd>Geometric Transformations</kwd>
        <kwd>Photometric Adjustments</kwd>
        <kwd>Noise Artifacts</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The MEDIQA–MAGIC 2025 challenge at ImageCLEF presents complementary subtasks of
dermatological image analysis, including semantic segmentation of problem regions and closed-ended question
answering [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The DermaVQA-DAS dataset, introduced for these subtasks, comprises patient-generated
dermoscopic photographs paired with binary masks and multiple-choice annotations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Accurate delineation of skin lesions is critical for diagnosis and treatment planning but remains
challenging due to limited annotated data, high intra-class variability, and imaging artifacts such as
hair occlusion and varying illumination [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ].
      </p>
      <p>
        Early encoder–decoder architectures like U-Net and its nested variant U-Net++ exploit multiscale
feature fusion via skip connections to capture both local and global context [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Attention U-Net
further integrates trainable attention gates to focus on relevant regions within the image [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The
nnU-Net framework automates configuration and hyperparameter tuning of U-Net pipelines, achieving
state-of-the-art performance across diverse biomedical segmentation benchmarks [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ].
      </p>
      <p>
        Transformer-based hybrids such as TransUNet combine convolutional encoders with global
selfattention to improve boundary delineation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], while models like Swin UNETR leverage shifted-window
attention for eficient multiresolution feature extraction [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Foundation models adapted to medical
imaging, notably MedSAM, ofer promptable zero-shot refinement that sharpens output masks with
minimal additional training [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Data augmentation remains indispensable for mitigating overfitting on small medical datasets [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In
addition to classical geometric and photometric transforms, learned augmentation policies such as
AutoAugment optimize transformation strategies via reinforcement learning [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], while RandAugment [14]
and AugMix [15] have demonstrated strong empirical performance in vision tasks.
      </p>
      <p>Recent advances include self-supervised pre-training of Swin Transformers for 3D medical image
analysis, which improves convergence and generalization [16], as well as split-attention U-Net variants
that deliver compact models without sacrificing accuracy [17].</p>
      <p>In this study, we focus on Subtask 1 (dermatological segmentation) of MEDIQA–MAGIC 2025,
proposing a robust pipeline that integrates: (a) comprehensive augmentation across geometric,
photometric, and noise/artifact families; (b) a hybrid ResNet-50–ViT-B/16 TransUNet backbone; and (c)
post-processing refinement via MedSAM.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. Dermatological Image Segmentation</title>
        <p>
          Dermatological image segmentation is a crucial but challenging task in computer-aided diagnosis and
population-scale screening for skin diseases [18, 19]. Key dificulties arise from the high intra-class
variability of lesions, ambiguous boundaries with normal skin, and imaging artifacts such as hair,
occlusion, or varying illumination [
          <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
          ]. Public datasets are often small, imbalanced, and costly to
annotate at the pixel level, limiting model generalization and causing overfitting when models are
deployed on data from diverse clinical settings [
          <xref ref-type="bibr" rid="ref5">18, 19, 5</xref>
          ]. These limitations have driven continuous
innovation in both dataset design and segmentation architectures for dermatology.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data Augmentation in Medical Segmentation</title>
        <p>
          Data augmentation is a central solution for improving generalizability and robustness in medical image
segmentation [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Geometric transformations (e.g., flipping, rotation, cropping) help address viewpoint
and scale variations, while photometric changes (e.g., contrast, brightness, color jitter) account for
device- and skin-related diferences [
          <xref ref-type="bibr" rid="ref12 ref6">20, 12, 6</xref>
          ]. The addition of noise and artifact (e.g., Gaussian noise,
motion blur, compression artifacts) is essential to simulate real-world imperfections. Studies have
shown that systematic, multi-operator augmentation pipelines—notably those implemented in the
Albumentations library [
          <xref ref-type="bibr" rid="ref12">20, 12</xref>
          ]—can significantly increase segmentation accuracy, even in low-data
or high-imbalance regimes. Efective augmentation is now recognized as a critical factor in nearly all
state-of-the-art solutions for medical image segmentation [
          <xref ref-type="bibr" rid="ref12 ref6">12, 6</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Deep Learning Architectures for Medical Segmentation</title>
        <p>
          Early breakthroughs in medical segmentation leveraged convolutional neural networks (CNNs), such
as U-Net and its extensions [
          <xref ref-type="bibr" rid="ref3 ref6">3, 6</xref>
          ], which were designed to capture local context and spatial structure.
However, CNNs often struggle with long-range dependencies and variable lesion shapes. The
introduction of Vision Transformers (ViT) [21] and hybrid models such as TransUNet [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] has significantly
advanced the field by combining global attention mechanisms with local feature encoding. More
recently, foundation models like MedSAM [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], pretrained on millions of medical images across modalities,
have demonstrated strong zero-shot and few-shot generalization for segmentation tasks.
        </p>
        <p>In addition, recent research by Nguyen-Tat et al. has proposed hybrid network architectures and
edgeaware attention mechanisms, exemplified by QMaxViT-UNet+ [ 22], as well as approaches that integrate
U-Net, attention mechanisms, and transformers for brain MRI tumor segmentation [23]. Further, these
authors have systematically evaluated the efectiveness of preprocessing and deep learning techniques
across multiple imaging modalities [24].</p>
        <p>
          Recent benchmarks and challenge results confirm that optimal performance requires both advanced
model architectures (e.g., CNN–Transformer hybrids, foundation models) and comprehensive, diverse
augmentation pipelines [
          <xref ref-type="bibr" rid="ref10 ref5">19, 5, 10</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Task and Dataset Descriptions</title>
      <sec id="sec-3-1">
        <title>3.1. Task Descriptions</title>
        <p>
          The 2nd MEDIQA–MAGIC 2025 shared task extends the 2024 multimodal dermatology
benchmark and targets automatic response generation from combined clinical narratives and dermoscopic
photographs.[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] Each encounter consists of (i) a clinical narrative context and (ii) one or more color
images of the skin lesion(s).[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] Two complementary subtasks are defined:
1. Segmentation of dermatological problem regions. Given the image(s) and clinical history,
systems must generate segmentations of the regions of interest that correspond to the described
dermatological problem.[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
2. Closed-ended question answering. For the same encounter, systems are given a dermatological
query, its accompanying images, and a closed-ended question with multiple-choice options; the
objective is to select the single correct answer.[25]
Segmentation performance is assessed with region-overlap metrics such as Intersection-over-Union
(Jaccard)[26], whereas the closed-question subtask is evaluated using accuracy and the macro-averaged
F1 score.[27]
In this study, we focus exclusively on Subtask 1—segmentation of dermatological regions of interest.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset Information</title>
        <p>
          For Subtask 1, we use the MEDIQA–MAGIC 2025 segmentation dataset [
          <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
          ]. It includes original
skin images with binary masks indicating afected regions [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The dataset is divided into three disjoint
splits, provided by the organisers and adopted without modification. Each image is annotated by four
distinct annotators: ann0, ann1, ann2, and ann3 [28].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Augmentation Strategy</title>
        <p>
          State-of-the-art segmentation networks require large, diverse, and balanced datasets to achieve robust
performance.[29] However, compiling pixel-level annotations for dermatology is slow, costly, and
constrained by ethical considerations. The MEDIQA–MAGIC 2025 training set provides only 2,474
dermoscopic photographs with corresponding binary lesion masks, which is significantly smaller than
typical deep learning datasets and exhibits pronounced imbalances across lesion types and anatomical
sites. To address these challenges, aggressive data augmentation is a cornerstone of our pipeline,
strategically designed along three complementary directions—geometric, photometric, and noise &amp; artifact—to
enhance model generalization, robustness, and real-world applicability.[
          <xref ref-type="bibr" rid="ref12">12, 30</xref>
          ] These directions are
chosen for the following reasons [
          <xref ref-type="bibr" rid="ref12">12, 30</xref>
          ]:
• Geometric Augmentations: Dermatological images are captured from varied angles, scales,
and orientations due to diferences in imaging equipment, patient positioning, and anatomical
locations. Geometric augmentations simulate these variations by altering the spatial configuration
of images, enabling the model to learn invariance to viewpoint, scale, and orientation changes.
This is critical for ensuring the model can accurately segment lesions regardless of how the image
is framed or oriented in real-world scenarios.[
          <xref ref-type="bibr" rid="ref12">12, 30</xref>
          ]
• Photometric Augmentations: Dermoscopic images are subject to variations in lighting
conditions, camera settings, and skin tones, which can significantly afect pixel intensities and colour
distributions. Photometric augmentations introduce controlled changes in colour, brightness, and
contrast to mimic these real-world variations, ensuring the model remains robust to diferences in
illumination and imaging hardware. This is particularly important for generalizing across diverse
patient populations and clinical settings.[
          <xref ref-type="bibr" rid="ref12">12, 30</xref>
          ]
• Noise &amp; Artifact Augmentations: Real-world dermoscopic images often contain imperfections
such as motion blur, sensor noise, or compression artifacts, which can degrade segmentation
performance if the model is not trained to handle them. Noise and artifact augmentations inject
realistic acquisition degradations, training the model to ignore irrelevant distortions and focus
on the underlying lesion features. This enhances the model’s resilience to suboptimal imaging
conditions encountered in clinical practice.[
          <xref ref-type="bibr" rid="ref12">12, 30</xref>
          ]
To realise these objectives, we employ the Albumentations library.[20] A total of 20 augmentation
operators are orchestrated and grouped into three semantic families: 18 geometric, 6 photometric,
and 6 noise &amp; artifact transforms.[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] This distribution prioritises geometric augmentations with 18
transforms, reflecting the dominant challenge of spatial variability in dermoscopic imaging due to diverse
capture angles, scales, and orientations—factors that critically impact lesion segmentation accuracy
in clinical settings. The higher number of geometric operators ensures comprehensive coverage of
these spatial variations, which are more prevalent and complex than photometric or noise-related
issues. In contrast, photometric and noise &amp; artifact directions are allocated 6 transforms each, as
their variability can be efectively addressed with a smaller, targeted set of operators, supplemented
by composite pipelines that enrich sample diversity without redundancy. This strategic allocation
optimises computational eficiency while maximising the model’s ability to generalise across real-world
imaging conditions.
        </p>
        <p>• Geometric (18 transforms): These manipulate spatial configuration to build invariance against
viewpoint and scale changes: HorizontalFlip, VerticalFlip, Transpose, CenterCrop,
RandomRotate90, RandomSizedCrop, RandomSizedCrop_0.1, PadIfNeeded,
ElasticTransform, GridDistortion, OpticalDistortion, AdvancedAugmentation,
AdvancedAugmentation_2, CompositeAug, Medium, comprehensive_augmentation, crop_80_comprehensive,
and color_transform.
• Photometric (6 transforms): Colour and intensity variations mimic diferent
lighting conditions and camera responses: AdvancedAugmentation,
AdvancedAugmentation_2, Medium_add_non_spatial_transformations, comprehensive_augmentation,
crop_80_comprehensive, and color_transform.
• Noise &amp; Artefact (6 transforms): These inject realistic acquisition degradations that the
model must learn to ignore: AdditiveNoise, AdvancedAugmentation,
AdvancedAugmentation_2, Medium_add_non_spatial_transformations, comprehensive_augmentation, and
crop_80_comprehensive.</p>
        <p>
          Composite pipelines: The transforms AdvancedAugmentation,
AdvancedAugmentation_2, Medium, Medium_add_non_spatial_transformations, CompositeAug,
comprehensive_augmentation, crop_80_comprehensive, and color_transform are higher-level pipelines
that compose multiple elementary operations, thereby substantially enriching sample diversity.
While our current pipeline relies on majority voting to aggregate the four annotator masks into
a single ground truth, we acknowledge that treating each annotation as a distinct supervision
signal—efectively using them as GT augmentations—could provide useful label diversity. To maintain
training consistency and evaluation comparability, we did not experiment with this approach in the
current study. Nevertheless, we consider it a promising direction for future exploration, particularly in
modeling annotator disagreement or uncertainty.[28]
With a robust augmentation strategy in place to address the limitations of the MEDIQA–MAGIC 2025
dataset, the next critical step is selecting an appropriate model architecture that can efectively leverage
this enriched data. The choice of model must balance computational eficiency with the ability to
capture both local and global contextual features inherent in dermoscopic images, paving the way for
the proposed TransUNet-based approach detailed in the following subsection.[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Training Pipeline</title>
        <p>
          The following steps outline our segmentation training strategy based on the MEDIQA–MAGIC 2025
dataset:
1. Majority Voting Label Aggregation: Each image is associated with four binary masks from
annotators ann0–ann3. These are aggregated into a single ground-truth mask using pixel-wise
majority voting [28, 31].
2. Data Augmentation by Keyword Category: Augment the dataset using three families of
transformations [
          <xref ref-type="bibr" rid="ref12">12, 20</xref>
          ]:
• geometric: flipping, rotation, scaling, etc.
• photometric: brightness, contrast, color jitter, etc.
        </p>
        <p>
          • Noise\&amp;Artifact: Gaussian noise, blur, synthetic hair occlusion, etc.
3. Training with Individual Augmentations: For each augmentation category, we train a
TransUNet model with a ResNet50–ViT-B_16 hybrid backbone to assess the impact of each
transformation type [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
4. Training with Combined Augmentations: We merge all augmentation categories and train
TransUNet models with diferent backbone configurations [32, 21]:
• ResNet50–ViT-B_16 (hybrid)
• ViT-B_16, ViT-B_32
        </p>
        <sec id="sec-4-2-1">
          <title>The best-performing model is selected based on validation performance [10].</title>
          <p>
            5. Model Refinement via MedSAM: We apply the MedSAM framework to refine predictions of
the best model. Specifically, we fine-tune four separate MedSAM models with a ViT-B backbone
on the training set, each using segmentation masks from a diferent annotator (ann0–ann3).
During inference, bounding boxes predicted by TransUNet are used as prompts, together with
the original image, to guide each MedSAM model in generating refined segmentations. [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ].
          </p>
        </sec>
        <sec id="sec-4-2-2">
          <title>The complete training pipeline is illustrated in Figure 2.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <sec id="sec-5-1">
        <title>5.1. Data Preparation</title>
        <p>The MEDIQA-MAGIC 2025 dataset, comprising 2,474 training, 157 validation, and 314 test images,
is preprocessed to support the segmentation task. Images and masks are normalised to
standardise pixel intensity values[33], and multiple masks per image (from more than one annotator) are
combined using majority voting[28]. The training set is augmented using the Albumentations
library with 20 operators[20], organised into subdirectories (e.g., AdditiveNoise, HorizontalFlip,
AdvancedAugmentation) to reflect geometric, photometric, and noise &amp; artefact strategies.
Additionally, a test subset of 295 images is extracted from the training set to evaluate the model during training,
enhancing real-time performance assessment.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Experiment Configurations</title>
        <p>
          Several experiments were conducted to evaluate the eficacy of the proposed segmentation framework
for the MEDIQA-MAGIC 2025 task, focusing on the integration of various augmentation strategies
and model architectures. All experiments were performed using NVIDIA Tesla P100 GPUs to ensure
consistent computational performance. The configurations are detailed as follows:
• TransUnet ResNet50–ViT-B16 with Geometric Augmentation: The TransUnet
ResNet50–ViT-B16 model[32, 21] was specifically trained with the geometric
augmentation subset for 10 epochs. This configuration focused on enhancing the model’s invariance to
spatial variations such as rotations, flips, and crops.
• TransUnet ResNet50–ViT-B16 with Photometric Augmentation: The TransUnet
ResNet50–ViT-B16 model underwent training with the photometric augmentation subset for
20 epochs. This setup aimed to improve robustness against variations in lighting, colour, and
contrast within dermoscopic images.
• TransUnet ResNet50–ViT-B16 with Noise &amp; Artefact Augmentation: The TransUnet
ResNet50–ViT-B16 model was trained using the noise &amp; artefact augmentation subset for 20
epochs. This configuration targeted the model’s resilience to imperfections such as motion blur
and sensor noise, common in real-world medical imaging.
• TransUnet Variants with All Methods: The TransUnet variants—ResNet50–ViT-B16,
ViTB_16, ViT-B_32, ViT-L_32, and ViT-L_16—were trained using all 20 augmentation methods for 8
epochs.[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] This experiment assessed the performance of diferent Transformer scales under a
unified augmentation strategy to identify the optimal backbone architecture.
        </p>
        <p>Across all configurations, the AdamW optimiser was employed[ 34] with a learning rate of 1 × 10− 5,
a weight decay of 1 × 10− 2, and a batch size of 8, consistent with best practices for deep learning
model training. The DiceLoss function (binary mode)[35] was utilised as the loss criterion to optimise
segmentation performance. A learning rate scheduler (ReduceLROnPlateau) was applied, reducing
the learning rate by a factor of 0.5 if the validation loss did not improve for 5 consecutive epochs.
Mixed precision training was enabled using a gradient scaler to enhance computational eficiency on
GPUs[36]. Additionally, early stopping was implemented with a patience of 10 epochs[37], halting
training if the validation loss ceased to improve, ensuring optimal model convergence. These settings
ensured a robust evaluation of the proposed methodology across diverse augmentation strategies and
architectural paradigms.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Experimental Results</title>
        <p>The performance of the proposed TransUNet segmentation framework was evaluated on the
MEDIQA-MAGIC 2025 dataset, comprising 157 dermoscopic image segmentation instances. The
primary metrics utilised were the Jaccard Index (IoU)[26] and Dice Coeficient[ 38], which are standard
for assessing segmentation quality. The results are presented in two tables: Table 2 for validation set
performance across four augmentation strategies and Table 3 for test set performance of the best model
and its refined version.</p>
        <p>The validation results indicate that the TransUNet model with the R50-ViT-B16 backbone, trained using
all augmentation methods, achieved the highest performance, with a Jaccard Index of 0.6770 and a
Dice Coeficient of 0.8074. This highlights the efectiveness of combining a ResNet50 backbone with a
Vision Transformer (ViT-B16) and a comprehensive augmentation strategy. Among the R50-ViT-B16
configurations, the “all” augmentation approach significantly outperformed others, while the
photometric augmentation yielded the lowest scores (Jaccard: 0.4137, Dice: 0.5853), likely due to insuficient
augmentation diversity impacting model generalisation. Comparing ViT variants, models with a patch
size of 16 (e.g., ViT-B_16, ViT-L_16) generally outperformed those with a patch size of 32 (e.g.,
ViTB_32, ViT-L_32), suggesting that finer patch granularity enhances segmentation accuracy in this context.
Best Model (TransUNet (R50-ViT-B_16, All))
Best Model + Refine (TransUNet (R50-ViT-B_16, All) + MedSAM fine)
Jaccard (IoU)
On the test set, the best model, TransUNet (R50-ViT-B16, All), achieved a Jaccard Index of 0.6458
and a Dice Coeficient of 0.7848, demonstrating robust generalisation to unseen data. However, the
refined version of this model, which incorporated MedSAM fine-tuning, showed a slight decrease in
performance (Jaccard: 0.6113, Dice: 0.7587). This suggests that the MedSAM fine-tuning approach may
require further optimisation to improve test set outcomes, possibly due to overfitting or misalignment
with the test distribution. These findings underscore the strength of the proposed TransUNet framework
while identifying potential areas for improvement in the refinement process.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Works</title>
      <p>
        In this paper, we presented a data augmentation pipeline[
        <xref ref-type="bibr" rid="ref12">12, 20</xref>
        ] and hybrid model strategy for medical
image segmentation in the ImageCLEFmedical 2025 challenge. We began by training multiple
segmentation models and selecting the best-performing architecture—TransUNet with a ResNet-50[32]
and ViT-B/16[21] hybrid backbone. Training with a comprehensive augmentation set significantly
improved segmentation outcomes[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The integration of MedSAM for post-processing refinement
further elevated segmentation accuracy, particularly in lesion boundaries and varied skin textures[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Our results demonstrate the benefits of combining a powerful hybrid model architecture with
systematic augmentation. Future directions include: (a) exploring architecture variations such as SwinUNet[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
and nnU-Net[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] under similar augmentation regimes, (b) incorporating learned augmentation
policies like AutoAugment[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] for further optimization, (c) extending our approach to high-resolution
dermoscopic images and cross-domain generalization[39], and (d) evaluating the performance of other
SAM-based or edge-aware refinement methods[ 40]. (e) using individual annotator masks as separate
supervision signals to better model annotation variability. [41]
      </p>
      <p>These enhancements aim to create a more generalizable and scalable framework for medical image
segmentation in real-world applications.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research is funded by University of Information Technology-Vietnam National University
HoChiMinh City under grant number D4-2025-04.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <sec id="sec-8-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <article-title>Overview of the mediqa-magic task at imageclef 2025: Multimodal and generative telemedicine in dermatology</article-title>
          ,
          <source>in: CLEF 2025 Working Notes, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Madrid, Span,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Codella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Novoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malvehy</surname>
          </string-name>
          ,
          <article-title>Dermavqa-das: Dermatology assessment schema (das) and datasets for closed-ended question answering and segmentation in patient-generated dermatology images</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brox</surname>
          </string-name>
          , U-net:
          <article-title>Convolutional networks for biomedical image segmentation</article-title>
          , in: N.
          <string-name>
            <surname>Navab</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hornegger</surname>
            ,
            <given-names>W. M.</given-names>
          </string-name>
          <string-name>
            <surname>Wells</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Frangi</surname>
          </string-name>
          (Eds.),
          <source>Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015</source>
          , Springer International Publishing, Cham,
          <year>2015</year>
          , pp.
          <fpage>234</fpage>
          -
          <lpage>241</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -24574-4_
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-C.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <article-title>Improving dermoscopic image segmentation with enhanced convolutionaldeconvolutional networks</article-title>
          ,
          <source>IEEE Journal of Biomedical and Health Informatics</source>
          <volume>23</volume>
          (
          <year>2019</year>
          )
          <fpage>519</fpage>
          -
          <lpage>526</lpage>
          . doi:
          <volume>10</volume>
          .1109/JBHI.
          <year>2017</year>
          .
          <volume>2787487</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          , L. Han,
          <string-name>
            <given-names>C.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Segment anything in medical images</article-title>
          ,
          <source>Nature Communications</source>
          <volume>15</volume>
          (
          <year>2024</year>
          )
          <article-title>654</article-title>
          . doi:
          <volume>10</volume>
          .1038/s41467-024-44824-z.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Rahman Siddiquee</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Tajbakhsh</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
          </string-name>
          , Unet++
          <article-title>: A nested u-net architecture for medical image segmentation</article-title>
          , in: D.
          <string-name>
            <surname>Stoyanov</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
            , G. Carneiro,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Syeda-Mahmood</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martel</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Maier-Hein</surname>
            ,
            <given-names>J. M. R.</given-names>
          </string-name>
          <string-name>
            <surname>Tavares</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bradley</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          <string-name>
            <surname>Papa</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Belagiannis</surname>
            ,
            <given-names>J. C.</given-names>
          </string-name>
          <string-name>
            <surname>Nascimento</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Conjeti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Moradi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Greenspan</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Madabhushi (Eds.),
          <article-title>Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support</article-title>
          , Springer International Publishing, Cham,
          <year>2018</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>11</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -00889-
          <issue>5</issue>
          _
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>O.</given-names>
            <surname>Oktay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schlemper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Folgoc</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Misawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McDonagh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. Y.</given-names>
            <surname>Hammerla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kainz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Glocker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Rueckert</surname>
          </string-name>
          ,
          <article-title>Attention u-net: Learning where to look for the pancreas</article-title>
          ,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1804</year>
          .03999. arXiv:
          <year>1804</year>
          .03999.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Isensee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Jaeger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A. A.</given-names>
            <surname>Kohl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Petersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. H.</given-names>
            <surname>Maier-Hein</surname>
          </string-name>
          ,
          <article-title>nnu-net: a self-configuring method for deep learning-based biomedical image segmentation</article-title>
          ,
          <source>Nature Methods</source>
          <volume>18</volume>
          (
          <year>2021</year>
          )
          <fpage>203</fpage>
          -
          <lpage>211</lpage>
          . URL: https://doi.org/10.1038/s41592-020-01008-z. doi:
          <volume>10</volume>
          .1038/s41592-020-01008-z.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Isensee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ulrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baumgartner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Maier-Hein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Jäger</surname>
          </string-name>
          , nnu
          <article-title>-net revisited: A call for rigorous validation in 3d medical image segmentation</article-title>
          , in: M. G. Linguraru,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Feragen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Giannarou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Glocker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lekadir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Schnabel</surname>
          </string-name>
          (Eds.),
          <source>Medical Image Computing and Computer Assisted Intervention - MICCAI 2024</source>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>488</fpage>
          -
          <lpage>498</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -72114-4_
          <fpage>47</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Luo</surname>
          </string-name>
          , E. Adeli,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Yuille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Transunet:
          <article-title>Transformers make strong encoders for medical image segmentation</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2102.04306. arXiv:
          <volume>2102</volume>
          .
          <fpage>04306</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hatamizadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Nath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images</article-title>
          , in: A.
          <string-name>
            <surname>Crimi</surname>
          </string-name>
          , S. Bakas (Eds.), Brainlesion: Glioma, Multiple Sclerosis,
          <source>Stroke and Traumatic Brain Injuries</source>
          , Springer International Publishing, Cham,
          <year>2022</year>
          , pp.
          <fpage>272</fpage>
          -
          <lpage>284</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -08999-2_
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Shorten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Khoshgoftaar</surname>
          </string-name>
          ,
          <article-title>A survey on image data augmentation for deep learning</article-title>
          ,
          <source>Journal of Big Data</source>
          <volume>6</volume>
          (
          <year>2019</year>
          )
          <article-title>60</article-title>
          . URL: https://doi.org/10.1186/s40537-019-0197-0. doi:
          <volume>10</volume>
          .1186/ s40537-019-0197-0.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Cubuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mané</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Autoaugment: Learning augmentation strategies from data</article-title>
          ,
          <source>in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>123</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2019</year>
          .
          <volume>00020</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>