<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>I. Neri);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Diagnosis with Non-Dermoscopic Images</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>ChiaraBellatrecci</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>DanieleZama</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>AriannaDondi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LucaPieranton</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>AndreozziLaura</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iria Neri</string-name>
          <email>iria.neri@aosp.bo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>MarcelloLanari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>AndreaBorghesi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberta Calegar</string-name>
          <email>roberta.calegari@unibo.i</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>AI Fairness, AI Ethics, Skin Disease Prediction</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IRCCS Azienda Ospedaliero-Universitaria di Bologna</institution>
          ,
          <addr-line>Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bologna</institution>
          ,
          <addr-line>Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>AI-based diagnosis of skin diseases holds considerable promise for increasing healthcare accessibility, however, its efectiveness is currently limited by several challenges, including fairness. This study analyzes a real-world dataset collected from an Italian hospital, characterized by limited data availability, leading to poor diversity and representation-particularly evident in the scarcity of data for certain diseases and darker skin tones. Such limitations result in substantial classification biases. Additionally, the dataset includes non-dermoscopic, consumer-grade images that sufer from quality issues like inconsistent lighting and blurriness, complicating the training of fair and eficient AI models. Conventional strategies to mitigate these problems, such as synthesizing images for underrepresented groups, are hindered by the dificulty in accurately identifying skin tones from poor-quality images. Our research introduces a novel pipeline designed to enhance both the accuracy and fairness of skin disease diagnosis by addressing the challenges posed by real-world data. The proposed solution involves a two-stage approach: 1) data pre-processing and augmentation to obtain images that more accurately represent darker skin tones, generated through a state-of-the-art difusion model; and 2) disease classification employing deep learning models. This methodology addresses data scarcity and improves fairness, with thorough validation of real-world data showing enhanced reliability and fairness in predictions across various skin diseases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Related</title>
    </sec>
    <sec id="sec-2">
      <title>Works</title>
      <p>The application of Artificial Intelligence (AI) in dermatological disease prediction ofers significant
advancements in diagnostic processes, facilitating rapid and eficient disease identification. However,
despite the promising capabilities of AI, issues such as bias and discrimination present substantial
challenges, particularly when these systems are applied across diverse demographic groups. Significant
eforts have been made to mitigate bias in dermatological AI applications without compromising the
privacy or integrity of demographic data. For instance, the study b1y] i[ntroduces a method to ensure
fairness by enhancing feature selection during the model training phase, purposely omitting sensitive
demographic attributes. This technique relies on sophisticated feature entanglement strategies to focus
solely on disease-relevant features, minimizing biases associated with non-disease attributes like skin
tone. Moreover, the introduction of PatchAlign, as discussed i2n], [marks a notable advancement in
aligning skin condition image patches with corresponding clinical descriptions. Using a Masked Graph
Optimal Transport (MGOT) algorithm efectively reduces noise and improves diagnostic accuracy and
fairness across various skin tones by focusing on disease-relevant image regions. The work3]of [
presents EDGEMIXUP, a preprocessing technique that alters image data to diminish bias by manipulating</p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
colour saturation and integrating edge detection outputs. This method has shown eficacy in decreasing
the performance disparity between diferent skin tones while maintaining overall diagnostic accuracy.
Similarly, the FairSkin framework introduced i4n] [leverages difusion models to generate synthetic
medical images that represent various skin tones equitably. Last5l]y,a[nd [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] propose innovative
solutions to enhance fairness through structural model adjustments. The FairQuantize methodology
employs weight quantization to adjust model performance across diferent demographics, and the
channel pruning approach identifies and reduces bias by pruning channels that disproportionately
afect specific demographic groups.
      </p>
      <p>
        While the related works present innovative solutions for addressing bias and achieving acceptable
accuracy in AI-based diagnostics for skin diseases, these solutions are still largely explorative and
preliminary, rather than robust solutions to be applied in real-world scenarios. When applied to
realworld scenarios, particularly employing non-dermoscopic images, they often yield unsatisfactory results
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. When used in our specific scenario, the existing techniques still pose significant challenges that
frequently lead to suboptimal outcomes if these techniques are applied in isola8t]i.onIt [is worth
emphasizing that the dataset used in our study introduces several unique challenges that must be
responsibly addressed. The main challenges are relate(d1)toinherent dataset features (including its
characteristics and variability), and (2) specific challenges related to the skewness of the available data,
which significantly over-represent certain populations, thus inducing unfairness in the classification
process (more details follow in the data description Sect2io).nFailing to meticulously study and address
these issues within the development pipeline could lead to misdiagnoses, which in turn may exacerbate
existing healthcare inequalities and result in adverse outcomes for afected patients. Such oversight
highlights the critical need for rigorous evaluation and refinement of AI diagnostic tools to prevent
potential harm and ensure their reliability and fairness across all populations.
      </p>
      <p>
        Our work builds upon existing state-of-the-art foundational eforts, with the goal of addressing
additional limitations in real-world, highly imbalanced datasets. We explicitly consider both
classification performance and fairness metrics in our analysis. There are a few methods in the literature that
aim at improving the fairness of non-dermoscopic image disease classification through the refinement
of sophisticated Deep Learning (DL) models9(,[
        <xref ref-type="bibr" rid="ref1 ref5 ref6">6, 5, 1</xref>
        ]) – our approach isorthogonal as we do not
focus on the classification model itself but rather propose a pipeline for image data pre-processing and
data augmentation that cacnomplement any existing DL model for classification of skin diseases. In
particular, our pre-processing technique employs the Individual Typology Angle (ITA) metric along
with a novel thresholding method based on a Gaussian Mixture Model to accurately measure the skin
tone depicted in each image. For data augmentation, we propose a novel combination of stable difusion
with DreamBooth to address the challenge of data scarcity, which is particularly acute for darker skin
tones. To the best of our knowledge, this is the first work to consider using DreamBooth for generating
skin disease images for diferent skin shades. Our pre-processing method can be afected by issues
such as poor lighting and image blurriness, which may distort the perceived skin tone. To counteract
these problems, we carefully hand-pick the images used for training DreamBooth, ensuring that they
represent the skin tones targeted for augmentation. This meticulous selection process is especially
crucial as only three out of the nine diseases catalogued in our dataset have examples of ’dark’ and
’brown’ skin, necessitating precise and representative training data to enhance model fairness and
accuracy. The final step is the training of DL models for skin disease classification using pre-processed
and augmented data. In the current study, we opted for two of the most eficient models currently
available, namely the Swin Transformer (ST) and the Convolutional Neural Network (CNN); potentially,
other DL approaches could be plugged in, according to the available resources and desired outcomes.
The overall pipeline is illustrated in F1iga.nd consists of the previously discussed preprocessing steps,
plus the comparison of enhanced results via data augmentation. Please note that the proposed pipeline
requires co-design and co-creation phases (especially in the selection phase during the pre-processing),
during which stakeholders (in this case, doctors) shall be involved to assist in the selection and validation
processes.
      </p>
      <p>The paper is organized as follows: Se2c.introduces the use case and the data that we targeted; S3ec.
describes the data-preprocessing technique and Se4ce.xplains the data augmenting procedure; then,</p>
      <p>Sec. 5 shows the results of the whole pipeline (in terms of classification accuracy and fairness) once the
classification models are inserted; finally, Sec6. concludes the paper.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Use Case and Dataset Description</title>
      <p>The dataset consists of approximately 8,000 images of 273 pediatric patients at Sant’Orsola hospital
in Bologn a1 representing nine possible skin diseasesd:rug-induced iatrogenic exanthema (DII ex.),
maculopapular exanthema (MP ex.), morbilliform exanthema (MF ex.), polymorphous exanthema (PM
ex.), viral exanthema (V ex.), urticaria, pediculosis, scabies andchickenpox. The images were captured
using consumer-grade cameras by the hospital’s doctors, meaning theynaoren-dermoscopic. The
dataset used in this study exhibits several critical characteristics that complicate the classification of
skin diseases, primarily due to its inherent properties and the conditions under which the images were
collected. Many of the images sufer from suboptimal lighting, causing skin tones to appear darker than
they are, which not only complicates disease classification but also significantly afects the accuracy
of skin tone assessments. This issue of misclassification is exacerbated by the high variability in the
images’ quality and size, as they were taken with diferent consumer-grade cameras. Such variability
necessitates a comprehensive standardization process to make the dataset compatible for processing by
neural networks. Additionally, the focus of the images is inconsistent—ranging from full-body shots to
close-ups of specific afected areas. This variability presents further challenges in accurately identifying
and analyzing disease-specific skin regions. Compounding these issues, some images exhibit blurriness,
which diminishes the clarity and usefulness of the data for disease diagnosis. The dataset is further
characterized by a notable scarcity of data, with only 273 patients represented, reducing the variability
essential for a robust medical analysis. This limited data is particularly problematic for certain diseases,
where only a few examples are available, skewing the class distribution and complicating the training of
a reliable and generalizable model. These issues, coupled with the fact that the images were captured by
medical professionals using consumer-grade cameras and are clinical rather than dermoscopic, introduce
additional challenges. The photographs are prone to problems such as inconsistent lighting, blurriness,
suboptimal angles, and other artefacts that negatively impact the quality and reliability of the data.
These factors must be carefully managed to develop efective and accurate AI-based diagnostic tools.
Moreover, the dataset presents significant challenges related to the representation of skin tones and
disease classes. Predominantly, the dataset contains images of patients with lighter skin tones, which
introduces a bias that complicates classification for less-represented skin tone categories. Additionally,
some diseases are overrepresented in the dataset, leading to a class imbalance where the network
is better at classifying certain illnesses over others. Although addressing class imbalance is not the
primary focus of this work, it remains a critical aspect that influences the overall model performance.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Data preprocessing</title>
      <p>The data preprocessing pipeline aims to standardize the dataset by generating uniformly sized image
crops. The objective is to identify and extract regions of the images containing visible skin disease,
using the binary mask associated with each image. The process follows a sliding-window approach and
consists of several steps. Initially, the algorithm starts at the top-left corner and extracts a fixed-size
patch measuring 256×256 pixels. Next, for each patch, a binary mask is used to calculate the disease
coverage, defined as the ratio of positive labels (1) to negative labels (0) within the mask. This metric
evaluates the presence of the skin disease based on contrast within the patch; patches are retained if the
disease coverage exceeds a predefined threshold, indicating suficient contrast, and discarded otherwise.
The patch extraction process is repeated as the sliding window moves across the image in set steps,
generating overlapping patches. To reduce redundancy, a non-maxima suppression procedure discards
patches with lower disease coverage when overlap exceeds a threshold. Finally, patches exhibiting low
contrast, such as those caused by poor illumination or blurriness, are removed to improve the overall
quality of the dataset. This step ensures that only well-defined and informative regions are retained
for further analysis. As expected, the preprocessing step reveals an inherent imbalance in the dataset
across the diferent disease classes. In particular, certain diseases are underrepresented, with fewer than
ten thousand examples available after preprocessing. This imbalance is anticipated to impact model
performance, particularly for less-represented diseases.</p>
      <sec id="sec-4-1">
        <title>3.1. Skin Tone Detection via ITA</title>
        <p>
          To accurately measure fairness across diferent skin tones, it is essential to correctly classify each image
according to the skin tone it represents. This approach allows then for precise evaluations for each skin
tone, identifying any performance diferences, such as the presence of bias. Skin tone classification is
commonly performed using the Individual Typology Angle (ITA), a metric first introduced by Chardon et
al. in 1991 1[0], and widely adopted in subsequent studies for its simplicity and efectivene1s1s,[
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ].
While this method has proven efective in controlled environments, such as dermoscopic datasets, it
assumes uniform illumination and does not account for variations introduced by pathological changes
in the skin or external artefacts. We propose a modified ITA computation methotdailored to our
dataset, which includes images of skin conditions captured under non-standardized conditions with
consumer-grade cameras. To address challenges such as altered pigmentation in the afected skin,
inconsistent illumination, and shadows, we exclude disease-afected regions from ITA computation
using segmentation maps, ensuring only unafected skin is analyzed.
        </p>
        <p>
          Unlike prior works relying on fixed thresholds from dermoscopic dataset1s2[
          <xref ref-type="bibr" rid="ref13 ref14">, 13, 14</xref>
          ], we classify ITA
values into skin tone categories using a Gaussian Mixture Model (GMM), which better handles dataset
variability. This refined method provides a more accurate representation of skin tone, enabling a fairer
evaluation of classification performance. The computation of the ITA must account for the fact that
skin afected by disease often appears darker and reddish compared to healthy skin. To ensure reliable
ITA values that represent the baseline skin tone, it is key to exclude disease-afected regions from the
calculation. This was achieved by applying a bitwainsdeoperation between the original image crop
and its corresponding segmentation mask, replacing disease-afected regions with black pixels. The ITA
value was then computed exclusively for the non-black pixels in the crop. The resulting distribution of
ITA values closely resembles a Gaussian distribution with a longer tail extending towards lower values.
Following the computation of ITA values, ranges are required to classify skin tone according to the
Fitzpatrick scale1[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], which categorizes skin into six types. Various thresholding schemes have been
proposed to map ITA values to Fitzpatrick skin types12[
          <xref ref-type="bibr" rid="ref13 ref14">, 13, 14</xref>
          ]. However, these ranges were primarily
designed for dermoscopic datasets, devoid of variability caused by illumination, angulation, or other
artefacts. Given the non-dermoscopic nature of our dataset, these thresholds were deemed unsuitable.
Instead, we assumed that images with similar skin tones exhibit similar ITA values within a reasonable
range of variation. To classify the ITA values, we fitted the distribution using a Gaussian Mixture Model
with six components, corresponding to the six skin tone categories in the Fitzpatrick scale. Each ITA
value was assigned to the Gaussian component that best represented its value. The resulting skin tone
labels were categorized adsark, brown, tan, intermediate, light, andvery light. Examples of the automatic
skin tone classification are presented in Figur2e. While the ITA value is generally robust, shadows
and poor illumination can lower the ITA value, resulting in a darker assigned skin tone. Nonetheless,
darker images—whether due to actual skin tone or suboptimal lighting—were correctly assigned a
lower ITA value, whereas lighter images were assigned higher ITA values. The distribution of skin
tone labels across the dataset shows that thdeark andbrown skin tone categories are underrepresented,
highlighting an imbalance in skin tone distribution within the dataset. Despite the robustness of the
ITA calculation, this labelling process is not entirely accurate. Poor illumination or other artefacts cause
the computed ITA value to deviate from the expected value for the true skin tone for a non-negligible
number of images. Future work could address this limitation by incorporating advanced correction
techniques for artefacts such as shadows and uneven lighting.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Synthetic Generation of Skin Disease Images</title>
      <p>In this section, we describe the process used to generate synthetic images. We first explain how the
DreamBooth model has been tailored to the specific use case, and its extreme imbalance. Then, we
introduce three approaches for incorporating the synthetic images into training sets to be used for the
downstream task of classifying the diseases via a DL model (described in5S)e.c.</p>
      <sec id="sec-5-1">
        <title>4.1. Image Generation via the Combination of DreamBooth and Stable Difusion</title>
        <p>The dataset used in this study contains a limited number of examples of skin diseases afecting individuals
with black skin, with at most 4 or 5 individuals with dark skin for each disease. The preprocessing
pipeline described in Sectio3ngenerates a large number of image crops from photographs of the same
individual. However, using all these crops to train an image generation model would be redundant, as
the crops originating from the same individual are highly similar to one another. Consequently, it is
suficient to select only a few representative crops (3 or 4) per individual with dark skin and construct a
small, curated dataset comprising multiple individuals with dark skin exhibiting the specific disease
of interest. This curated dataset can then be used to train an image generation model. One model
well-suited for training with such a limited number of examples is DreamBooth, introduce1d5b].yI[t
is a fine-tuning technique for generative models, like difusion-based ones, enabling the creation of
high-quality, subject-specific images from just a few samples. This approach not only personalizes the
model but also maintains its capacity to produce diverse and photorealistic outputs, making it ideal
for scenarios constrained by data scarcity. In our work, DreamBooth was employed to fine-tune a
pre-trained Stable Difusion model, which was pre-trained on images of size 512×512. The fine-tuning
process was divided into the following stages:
1. Exploration of the Dataset – A manual inspection of the dataset was conducted to identify the
skin diseases for which images of ’dark’ or ’brown’ skin types were available. This exploration
revealed that only three out of the nine diseasems—aculopapular exanthema, viral exanthema,
andscabies—contained images of individuals with ’dark’ and ’brown’ skin. Consequently, image
generation was applied only to these three diseases.
2. Construction of Mini-Datasets – Mini-datasets were manually constructed for each of the three
diseases, separately for ’brown’ and ’dark’ skin types. This process resulted in six datasets (two
for each disease: one for ’brown’ skin and one for ’dark’ skin), with each dataset containing
between 14 and 29 images.
3. Fine-Tuning with DreamBooth – For each of the six mini-datasets, the Stable Difusion model was
ifne-tuned using the DreamBooth technique. A grid search was conducted to identify optimal
hyperparameter configurations, exploring the following parameters:
• Learning rate: Values of 5e-7, 2e-6, 5e-6, and 1e-5 were tested, to ensure adequate exploration
of the parameter space, as the learning rate is a critical factor for convergence.
• Maximum training steps: For mini-datasets with fewer than 15 images, values of 1000, 2000,
3000, and 4000 steps were tested. For mini-datasets with more than 15 images, values of
2000, 3000, 4000, and 5000 steps were tested. This choice was guided by a commonly applied
rule of thumb in DreamBooth, which recommends fine-tuning with at least 100 training
steps per image.
• Instance prompt: The instance prompt in DreamBooth plays a key role in both the training
and image generation phases. During training, a unique identifier (e&lt;.gu.n,ique_ID&gt;) is
included in the prompt alongside descriptive context (e.g., ”human skin” or ”a person with a
skin condition”) to associate the fine-tuned model with the specific features of the training
images. This enables the model to learn how to reproduce those features while maintaining
its broader generative capabilities. During image generation, the instance prompt is used to
guide the model in synthesizing new images that reflect the characteristics of the fine-tuned
training data. By combining or modifying the instance prompt with additional textual
descriptions, keeping the&lt;unique_ID&gt; in the text, it is possible to control the specific
details of the generated images, ensuring alignment with the desired output while retaining
diversity and realism. In our case, both the promp&lt;tu”nique_ID&gt;” and ”&lt;unique_ID&gt;
human skin” were evaluated. Including ’human skin’ in the prompt was hypothesized to
provide context and aid in accurately reproducing skin texture.</p>
        <p>A batch size of 1 was selected, as experiments revealed that smaller batch sizes promoted greater
diversity when other hyperparameters were held constant.
4. Model Selection – For each of the six mini-datasets, the fine-tuned models with the most promising
hyperparameter configurations were selected based on an empirical evaluation of the generated
images. The evaluation prioritized diversity, accuracy, faithful representation of skin texture and
colour, and fidelity to the real images in the mini-dataset.</p>
        <p>In Figure3 we can see a sample of the results obtained through the synthetic generation of images, in
particular for images of the class ’brown’ and ’dark’. We report the results for the three target diseases
(scabies, viral exanthema, and maculopapular exanthema); for each disease, we show some real images
and some generated ones. The synthetic images are extremely realistic and the skin tone matches the
desired one.</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Data Augmentation for Rare-colour Skin Images</title>
        <p>After determining the required number of synthetic images for each of the three diseases and the two
skin colours (”dark” and ”brown”), we distributed this total among the fine-tuned models designated
for each disease and skin colour. This approach ensured that the synthetic images were generated by
models fine-tuned with various hyperparameter combinations, enhancing the diversity of the dataset.
Models fine-tuned with diferent hyperparameters typically produce images with distinct characteristics,
reflecting the variations in the mini-datasets used for training. Furthermore, to increase the diversity of
the synthetic images, we frequently changed the generation seed.</p>
        <p>To incorporate the synthetic images into the original training set—while keeping the test and
validation sets comprised solely of real images—to provide more examples of diseases on darker skin tones,
we followed and compared three distinct numerical approaches:
1. AugMin– synthetic images of ’dark’ and ’brown’ skin are added to ensure that the total number
of images (real + synthetic) for each disease matches the smallest image count among the other
four skin colours (’very light,’ ’light,’ ’intermediate,’ ’tan’). For example, if the ’tan’ skin color has
the fewest examples for scabies, with images, then an equal number of synthetic images for
’dark’ and ’brown’ skin are added to reach a total oifmages for these colors as well.
2. AugBalanced– synthetic images are added for ’dark’ and ’brown’ skin tones for each disease
such that the number of images for each of these two colours represents approximately one-sixth
(approximately 17%) of the total images for that disease. This strategy aims to achieve a more
balanced distribution across all skin colours and diseases.
3. AugMax– similar to the first approach, but in this case, the total number of images (real +
synthetic) for each disease and each of the two skin colours (’brown’ and ’dark’) was adjusted
to match the largest number of images among the other four skin colours (’very light,’ ’light,’
’intermediate,’ ’tan’) for that disease.</p>
        <p>These three distinct approaches help evaluate the impact of including varying proportions of ’dark’
and ’brown’ skin images. This enables a constructive analysis of how representing underrepresented
skin tones in the training set afects model performance and fairness outcomes. Comparing these
proportions is particularly valuable for understanding the trade-of between fairness and performance
and identifying the balance that optimizes equitable representation across skin tones with high predictive
accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Disease Classification with DL Models</title>
      <p>The dermatological classification task demands that the model captures complex features across diferent
scales. For this reason, we have selected two of the best state-of-the-art models, the Convolutional
Neural Network (CNN) and the Swin Transformer (ST). We start by measuring the performance of
these modelswithout data augmentation, to serve as a baseline. Since performance results ofer limited
insight into fairness concerns in the model’s diagnostic outcomes, we evaluate the models using fairness
metrics common in the literature and relevant to tasks like skin disease predic9ti]o, nsp[ecifically
reporting the Disparate Impact RatioD(I ) [16], Equalized Odds RatioE(OR) [17], and Predictive Rate
Ratio (PRR). Fairness metrics are often developed for binary outcome tasks, whereas the classification
task in this study is multi-class and involves more than two demographic groups. To adapt these metrics
to our setting, skin tones were aggregated into two broader categories: a minority group (“dark” and
“brown” skin tones) and a majority group (“tan,” “intermediate,” “light,” and “very light” skin tones). This
grouping is informed by two key considerations(1:) the observed underrepresentation of “dark” and
“brown” skin tones in the dataset an(d2) the adoption of a similar approach by9][. This aggregation
facilitates a meaningful application of fairness metrics while addressing the challenges posed by a
multi-class, multi-group setting. Note that this aggregation did not require re-training, as the model is
blind to the “skin tone” attribute during training. It afects only the evaluation process.</p>
      <sec id="sec-6-1">
        <title>5.1. Convolutional Neural Network - No Data Augmentation</title>
        <p>A deep CNN architecture comprising five convolutional layers, each with a kernel size of 3, was selected
to address this. After each convolutional block, a MaxPooling layer with a kernel size of 2 is applied.
A final fully connected layer outputs nine logits corresponding to the nine target classes. The CNN
architecture was selected after a preliminary empirical evaluation, and its hyperparameters were
handtuned. The dataset of cropped images was partitioned into training (60% of the samples), validation
(20%), and test sets (20%). A hyperparameter tuning phase was conducted to optimize the batch size
and learning rate. Six combinations of these parameters were evaluated, with the model trained for 5
epochs using stochastic gradient descent (SGD) with momentum as the optimizer. Optimal performance
was achieved with a batch size of 128 and a learning rate of 0.01. For the final model training, the
configuration included the following settings: batch size = 128, learning rate = 0.01, number of epochs
= 15, and optimizer = SGD with momentum. Additionally, a cosine decay learning rate scheduler and
an early stopping mechanism were employed to prevent unnecessary resource usage in cases of early
overfitting.</p>
        <p>Classification results. The model was evaluated using standard performance metrics, specifically F1
score and Accuracy. Results aggregated for each disease and for each skin tone are presented in the
ifrst column of respectively Table1 and Table2.</p>
        <p>Accuracy results aggregated for skin tones indicate consistent performance across most skin tones,
except for the “very light” category, which exhibited significantly higher accuracy. This discrepancy
may be attributed to the higher overall quality and better illumination of “very light” samples, rendering
them easier to classify. The model’s performance lacks consistency when evaluated separately for each
disease. Specifically, for approximately half of the diseases, the Accuracy and F1 score on the test set
are higher for the Majority group than for the Minority group, contrary to the hypothesis that the
Minority group would be systematically disadvantaged. Two potential explanations may account for
this trend in traditional metric(s1: ) factors such as poor illumination, body hair, or artefacts may lead
to misclassification of skin tone in some images, causing certain images to be incorrectly categorized
into the Minority group rather than the Majority group. This misclassification complicates the reliable
assessment of bias;(2) a lack of variability between the training set and the test set for the “black”
and “brown” skin categories (particularly for the former) may unintentionally inflate classification
performance. For example, in the case o“bflack” skin tone and the diseasemaculopapular exanthema,
we observe a remarkably higheArccuracy andF1 score for the Minority group. Upon closer qualitative
analysis, we find that the dataset includes only one individual w“bitlahck” skin tone and this disease. As
described in Section3, the cropping algorithm generates multiple image crops from a single individual,
distributing them across the training, validation, and test sets. During training, the model learns to
classify these crops efectively, and at test time, it encounters test crops highly similar to those seen
during training. This results in an artificially inflateAdccuracy for the “black” skin tone in this specific
disease. We hypothesize that if the test set contained images of other individuals w“bitlahck” skin tone
who were not seen during training, the model’s performance would decrease significantly. In contrast,
the Majority group likely benefits from greater diversity in the training data. The model is exposed to a
wide variety of images from diferent individuals, enabling it to generalize better when presented with
test crops from unseen Majority group individuals, resulting in more robust performance.</p>
        <p>However, accuracy alone does not reveal the distribution of misclassifications across skin tones. As
for fairness considerations, Tabl3esummarizes the results, discussed in detail in the following.</p>
        <p>The DI was calculated separately for each condition, w h ê=re1 represents the presence of the
disease and ̂ = 0 its absence. A value between 0.8 and 1.25 is generally considered fair. Values
below 0.8 indicate unfairness against the minority group, whereas values above 1.25 suggest unfairness
against the majority group. Notably, the model demonstratseigsnificant bias against the minority
group for diseases such as pediculosis and chickenpox, as evidencedDbIyvalues below 0.8. This
disparity may be attributed to the limited number of positive samples from the minority group for these
conditions. In contrast, for diseases such as maculopapular rash, morbilliform rash, and scabies, there is
a proportionally higher number of positive detections in the minority group compared to the majority
group, resulting inDI values above 1.25. These observations highlight the varying degrees of fairness
across diferent conditions and the impact of sample imbalances in fairness evaluations. AEsOfoRr,
in our experiments,EOR is computed for each disease, using the common division of skin tones into
minority and majority groups. The results show that of the nine EOR values, only three, corresponding
to maculopapular exanthema, viral exanthema andurticaria fall within the fairness range, indicating
that the model’s performance is not consistent across diferent demographic groups. In contrast to
DI andEOR results, we observe that thePRR values are fair across all diseases. To understand this
diference, it is important to note that the PPV (used to compute th e ) relative to a group measures
how often the model correctly predicts the positive class for that group. In this sense, PPV serves as a
measure of thequality of predictions. On the other hand, theDI focuses on the probability of a positive
prediction for each group, regardless of its correctness, making it a measurqeuoafntity. Similarly,
the EOR evaluates the True Positive Rate (recall) and the False Positive Rate, which also reflect the
the model’s precision across groups, the other metricDs(I andEOR) assess the distribution and balance
of predictions among groups.</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Swin Transformer - No Data Augmentation</title>
        <p>As the second model for our study, we utilized a Swin Transforme1r8][. During training, the same
dataset partitioning used for the CNN was adopted: 60% for training, 20% for validation, and 20% for
testing. To efectively capture the skin texture and disease-specific characteristics, a model variant
pre-trained on ImageNet-1k and later fine-tuned on a skin cancer data2sewtas employed. The last
three stages of the ST were fine-tuned, while the weights of the first stage were frozen, resulting in
a total of 26 million trainable parameters. To ensure convergence and fully explore the weight space,
hyperparameter tuning focused on learning rate values, specifically 1e-5, 1e-4, 1e-3, and 1e-2. While
lower learning rates ensured good convergence, they often led the model to converge to local minima,
resulting in low accuracy and F1 scores. Consequently, a learning rate of 1e-2 was selected, enabling
larger training steps. To stabilize training in the later epochs, the learning rate was reduced by a factor
of 100 after nine epochs, based on the observed loss trends.</p>
        <p>Classification results. The final performance results are shown in the first column of Tables4, 5
and6 aggregated by disease and skin tones respectively.</p>
        <p>First, we notice a significant performance improvement compared to the CNN, likely due to the
remarkably higher capacity of the ST∼(26 million trainable parameters versus t∼he4 million of the
2https://huggingface.co/gianlab/swin-tiny-patch4-window7-224-finetuned-skin-cancer
CNN). Moreover, it is evident that, while for the CNN thAeccuracy andF1 scores vary significantly
between demographic groups depending on the disease, the ST shows more consistent Accuracy and
F1-score values. TheDI values of the ST indicate that, on average, it demonstrates greater fairness
compared to the CNN. Specifically, the Swin Transformer exhibits only three out of nine instances
of unfair values for the DI metrics (Tabl5e), in contrast to the CNN, which presents five out of nine
instances of unfair values for the same metrics. On the other hanEdO,R values are systematically
lower in the Swin Transformer results compared to those of the CNN: of the niEnOeR values, only
one—specifically the one corresponding toviral exanthema—falls within the fairness range, once again
indicating that the model’s predictions are strongly influenced by tshkein tone attribute. As for thePRR
values, they remain within acceptable limits, indicating that the model’s precision is comparable for
both the Minority and Majority categories. However, as already stated at the end of Sect5i.o1n,this
does not imply a fair prediction across the various skin tones, as evidenced by the values of the other
metrics.</p>
      </sec>
      <sec id="sec-6-3">
        <title>5.3. Convolutional Neural Network - With Data Augmentation</title>
        <p>The addition of synthetic images generally resulted in a significant performance improvement across all
diseases3, as evidenced by the values of Tabl1e. This efect may be attributed to the regularizing impact
of these new data on the dataset, which benefited all diseases. Furthermore, Accuracy and F1-score also
improved across individual skin tones, including both darker and lighter tones, for which no synthetic
images were generated. Overall, except for the ‘very light’ skin tone, the addition of synthetic data
helped equalize performance across skin tones, raising metrics for lighter tones (which previously had
lower Accuracy and F1 scores compared to ‘dark’ and ‘brown’ tones) more than it did for darker tones.
Regarding fairness metrics, synthetic data also benefited diseases for which no synthetic images were
generated. We provide now a more detailed analysis of each augmentation technique.</p>
        <p>AugMin adds the fewest synthetic images and provides the smallest improvement in terms of both
traditional and fairness metrics, suggesting that more synthetic data could be beneficial. As shown
in Table1, the Accuracy and F1-score improvements for individual diseases sometimes narrowed the
performance gap between Minority and Majority groups (e.g., in the casedorufg-induced iatrogenic
exanthema, morbilliform exanthema, polymorphous exanthema, viral exanthema, urticaria, andscabies),
while in other cases, the performance gap widened. Regarding fairness metrics (T3a)b,lneo significant
improvement was observed for the three diseases targeted with synthetic images in ‘dark’ and ‘brown’
tones, and in some cases, a decline was noted. Interestingly, however, certain diseases for which no
synthetic images were generated showed counterintuitive fairness improvements. In summary, this
approach appears to function as a regularizer that enhances overall performance and improves the
homogeneity of model performance across skin tones. However, it is not efective in improving classification
fairness, particularly for the targeted diseases (im.ea.,culopapular exanthema, viral exanthema, and
scabies). AugBalanced outperformed the previous approach in terms of both Accuracy and F1-score.
In this case, fairness outcomes for the three targeted diseases were very similar to those observed
without synthetic data. However, for most other diseases, fairness appeared to improve, particularly
for EOR values. Overall, this demonstrates that a greater presence of synthetic data has a stronger
regularizing efect on performance, benefiting nearly all diseases and all skin toAneusg.Max yielded
the best trade-of between fairness and performance. Accuracy and F1-score metrics remained higher
than the model trained on the original dataset, while the average fairness metrics for each disease fell
within ranges considered fair. Significant improvements were observed for botDhI andEOR values in
two of the three diseases for which synthetic images were generated (i.vei.r,al exanthema andscabies).
Additionally, for most other diseases, fairness metrics also improved. For the CNN, generating synthetic
images for the three targeted diseases and incorporating them into the dataset proved beneficial for
both overall model performance and classification fairness. The more synthetic images, the merrier:
AugMax achieves the best trade-of between fairness and performance.
3Including those for which no synthetic data was added</p>
      </sec>
      <sec id="sec-6-4">
        <title>5.4. Swin Transformer - With Data Augmentation</title>
        <p>The ST model has already demonstrated very high accuracy and F1-score values across all diseases
and skin tones, althoughEOR values were notably problematic, especially for the latter diseases. With
augmentationA,ugMin improved accuracy and F1-scores for several diseases in minority categories
(i.e., ‘dark’ and ‘brown’ skin tones), although at a slight cost to the majority category. Overall, the
classification performance remained comparable to the model trained on the original dataset across all
skin tones and diseases. In terms of fairnesAs,ugMin led to significant improvements, particularly
in EOR values; moreover,DI values improved for two of the three target diseases, anPdRR values
also showed improvement. Compared to other approacheAs,ugBalanced reduces accuracy and
F1score values. However, it shows notable improvements iEnOR values, with six out of nine improving,
although at the expense of two deteriorating compared to the model trained on the original daDtaIset.
values worsened, whilPeRR values improved. On the other handA, ugMax maintains good accuracy
and F1-score values, though distributed diferently compared tAougMin. In terms of fairness metrics,
this approach performs well foDrI andPRR values, though it does not substantially improve over the
model trained on the original dataset, except pfoedriculosis. However, it is less efective for EOR values,
except forurticaria andscabies, where it achieves fairness range values. In conclusion, for the ST model,
AugMin proved to be the most efective.</p>
        <p>This outcome can be attributed to at least two factors: (1) Although the ST model has significantly
more trainable parameters than the CNN, it is pre-trained, which makes it inherently more resistant to
substantial changes. In contrast, the CNN is trained entirely from scratch, providing greater flexibility
for performance improvements. This explains why the impact of synthetic images is more pronounced
with the CNN; (2) the ST had already achieved high accuracy and F1-score during initial training without
synthetic data, showing fewer signs of overfitting compared to the CNN. Therefore, synthetic images
did not produce the same regularizing efect on the ST as on the CNN, where there was more room
for improvement. This also clarifies why the ST benefited more from an approach involving fewer
synthetic data: adding more data likely pushed the model beyond its ’saturation point,’ thus limiting
the desired improvements in fairness, although it still managed to deliver better fairness results than
the same model trained on the original dataset.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>This study demonstrated the efectiveness of using advanced image generation techniques, like
DreamBooth combined with stable difusion, to enhance the representation of underrepresented skin tones
in medical datasets. Our methods significantly improved fairness metrics, balancing performance
and fairness efectively. Incorporating synthetic images, especially in the training sets for diseases
afecting ’dark’ and ’brown’ skin tones, addressed data scarcity issues and reduced bias in medical
image analysis. The comparison of diferent data augmentation strategiAesug(Min, AugBalanced,
AugMax) helped us understand the trade-ofs between dataset diversity and predictive accuracy. While
the CNN showed more significant improvements due to its flexibility, the pre-trained nature of the
ST limited its adaptability to synthetic data enhancements. However, both models benefited from our
approach, underscoring the potential of synthetic data to improve diagnostic tools across diverse skin
tones. We also noticed that although synthetic images are produced only for specific diseases, the
experimental results demonstrate enhanced performance across all nine diseases catalogued in our
study. Our findings advocate for the continued use of synthetic data augmentation to enhance fairness
and performance in dermatological AI applications, paving the way for more equitable healthcare
solutions.</p>
      <p>A potential extension of this work could involve generating images of diseases on dark skin by
adapting images of diseases from lighter skin tones. This approach would create more examples of
diseases of dark skin, which are currently underrepresented. However, this method must be approached
with caution, as the texture and appearance of dermatological conditions can vary significantly between
diferent skin tones, potentially afecting the scientific accuracy of the generated images. This careful
consideration is essential to ensure the development of AI-based diagnostic tools that are both efective
and equitable across diverse populations.
image difusion models for subject-driven generation, 2023. URLh:ttps://arxiv.org/abs/2208.12242.
arXiv:2208.12242.
[16] M. Feldman, S. Friedler, J. Moeller, C. Scheidegger, S. Venkatasubramanian, Certifying and removing
disparate impact, 2015. URLh:ttps://arxiv.org/abs/1412.3756. arXiv:1412.3756.
[17] A. Agarwal, A. Beygelzimer, et al., A reductions approach to fair classification, in: J. Dy, A. Krause
(Eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of
Proceedings of Machine Learning Research, PMLR, 2018, pp. 60–69. URL: https://proceedings.mlr.
press/v80/agarwal18a.htm. l
[18] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin transformer: Hierarchical
vision transformer using shifted windows (2021) 10012–10022.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>C.-H. Chiu</surname>
            ,
            <given-names>Y.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Shi</surname>
          </string-name>
          , T.-Y. Ho,
          <article-title>Achieve fairness without demographics for dermatological disease diagnosis</article-title>
          ,
          <source>Medical Image Analysis</source>
          <volume>95</volume>
          (
          <year>2024</year>
          )
          <article-title>103188</article-title>
          . UhtRtLps::// www.sciencedirect.com/science/article/pii/S13618415240011.3d0oi:https://doi.org/10.1016/ j.media.
          <year>2024</year>
          .
          <volume>103188</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Aayushman</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Gaddey</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Mittal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chawla</surname>
            ,
            <given-names>G. R.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
          </string-name>
          ,
          <article-title>Fair and accurate skin disease image classification by alignment with clinical labels</article-title>
          , in: M. G. Linguraru,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Feragen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Giannarou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Glocker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lekadir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Schnabel</surname>
          </string-name>
          (Eds.),
          <source>Medical Image Computing and Computer Assisted Intervention - MICCAI 2024</source>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>394</fpage>
          -
          <lpage>404</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hadzic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Paul</surname>
          </string-name>
          , D. V. de Flores,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mathew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aucott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Burlina</surname>
          </string-name>
          ,
          <article-title>Edgemixup: improving fairness for skin disease classification and segmentation</article-title>
          ,
          <source>arXiv preprint arXiv:2202.13883</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          , S. Liu, T. Chen, Fairskin:
          <article-title>Fair difusion for skin disease image generation</article-title>
          ,
          <source>arXiv preprint arXiv:2410.22551</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Shi,</surname>
          </string-name>
          <article-title>FairQuantize: Achieving Fairness Through Weight Quantization for Dermatological Disease Diagnosis</article-title>
          ,
          <source>in: proceedings of Medical Image Computing and Computer Assisted Intervention - MICCAI 2024, volume LNCS 15010</source>
          , Springer Nature Switzerland,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Chiu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>Y.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , T.-Y. Ho,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Shi</surname>
          </string-name>
          ,
          <article-title>Achieving fairness through channel pruning for dermatological disease diagnosis</article-title>
          , in: M. G. Linguraru,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Feragen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Giannarou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Glocker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lekadir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Schnabel</surname>
          </string-name>
          (Eds.),
          <source>Medical Image Computing and Computer Assisted Intervention - MICCAI 2024</source>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Addressing fairness issues in deep learning-based medical image analysis: a systematic review</article-title>
          ,
          <source>npj Digital Medicine</source>
          <volume>7</volume>
          (
          <year>2024</year>
          )
          <fpage>286</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E. R.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Trager</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Geskin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Dugdale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. H.</given-names>
            <surname>Samie</surname>
          </string-name>
          ,
          <article-title>Ethical considerations for artificial intelligence in dermatology: a scoping review</article-title>
          ,
          <source>British Journal of Dermatology</source>
          (
          <year>2024</year>
          )
          <article-title>ljae040</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Corbin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Marques</surname>
          </string-name>
          ,
          <article-title>Assessing bias in skin lesion classifiers with contemporary deep learning and post-hoc explainability techniques</article-title>
          ,
          <source>IEEE Access 11</source>
          (
          <year>2023</year>
          )
          <fpage>78339</fpage>
          -
          <lpage>78352</lpage>
          .
          <year>d10o</year>
          .
          <source>i:1109/ACCESS</source>
          .
          <year>2023</year>
          .
          <volume>3289320</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chardon</surname>
          </string-name>
          , I. Cretois,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hourseau</surname>
          </string-name>
          ,
          <article-title>Skin colour typology and suntanning pathways</article-title>
          ,
          <source>International Journal of Cosmetic Science</source>
          <volume>13</volume>
          (
          <year>1991</year>
          ). URLh:ttps://api.semanticscholar.org/CorpusID:2565093.
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. K.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <article-title>Skin typing: Fitzpatrick grading and others</article-title>
          ,
          <source>Clinics in Dermatology</source>
          <volume>37</volume>
          (
          <year>2019</year>
          )
          <fpage>430</fpage>
          -
          <lpage>436</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S0738081X193012.1X doi:https://doi.org/10.1016/j.clindermatol.
          <year>2019</year>
          .
          <volume>07</volume>
          .010, the Color of Skin.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Groh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Soenksen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lau</surname>
          </string-name>
          , R. Han,
          <string-name>
            <given-names>A</given-names>
            .
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koochek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Badri</surname>
          </string-name>
          ,
          <article-title>Evaluating deep neural networks trained on clinical images in dermatology with the fitzpatrick 17k dataset</article-title>
          , in: 2021 IEEE/CVF Conference on
          <article-title>Computer Vision and Pattern Recognition Workshops (CVPRW)</article-title>
          ,
          <source>IEEE Computer Society</source>
          , Los Alamitos, CA, USA,
          <year>2021</year>
          , pp.
          <fpage>1820</fpage>
          -
          <lpage>1828</lpage>
          . URL: https://doi.ieeecomputersociety.
          <source>org/10.1109/CVPRW53098</source>
          .
          <year>2021</year>
          .
          <volume>0020</volume>
          .1doi:
          <fpage>10</fpage>
          .1109/CVPRW53098.
          <year>2021</year>
          .
          <volume>00201</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Kinyanjui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Odonga</surname>
          </string-name>
          , et al.,
          <article-title>Fairness of classifiers across skin tones in dermatology</article-title>
          ,
          <source>in: Medical Image Computing and Computer Assisted Intervention - MICCAI 2020</source>
          , Springer International Publishing, Cham,
          <year>2020</year>
          , pp.
          <fpage>320</fpage>
          -
          <lpage>329</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Charlton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Stanley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Whitman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Wenn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Coats</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sims</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <article-title>The efect of constitutive pigmentation on the measured emissivity of human skin</article-title>
          ,
          <source>PLOS ONE 15</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . URL: https://doi.org/10.1371/journal.pone.
          <volume>024184</volume>
          .3doi:
          <fpage>10</fpage>
          .1371/journal.pone.
          <volume>0241843</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jampani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pritch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rubinstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Aberman</surname>
          </string-name>
          , Dreambooth: Fine tuning text-to-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>