<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Spiking Neural Networks and Their
Applications: A Review. Brain Sciences</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Integrating сonvolutional neural networks and autoencoders for skin lesion diagnosis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kateryna Bilyk</string-name>
          <email>kkaterynabilyk@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olga Narushynskа</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petro Liashchynskyi</string-name>
          <email>petro.b.liashchynskyi@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasyl Teslyuk</string-name>
          <email>vasyl.m.teslyuk@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>12 Bandera Street, Lviv, 79013</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>MoMLeT-2025: 7th International Workshop on Modern Machine Learning Technologies</institution>
          ,
          <addr-line>June, 14, 2025, Lviv-Shatsk</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>12</volume>
      <issue>7</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This study explores the effectiveness of deep learning models for the automated detection and classification of skin lesions in dermoscopic images. A combination of models was tested, starting with an autoencoder for preliminary detection of mole presence. This step helps filter out irrelevant images before further processing. The subsequent stage involves comparing several models for lesion classification, with a custom ResNet-50-based classifier achieving the highest performance, with a validation F1-score of 0.886, confirming its suitability for diagnostic tasks. For segmentation, a Mask R-CNN model was employed, achieving an Intersection over Union (IoU) of 87%. This model accurately detects and segments all visible moles, regardless of their size or location, enabling the classification of each individual lesion - addressing a key limitation of traditional methods. The models were trained and evaluated using a combination of publicly available datasets and synthesized images with artificially added lesions, enhancing dataset variability and realism. The findings indicate that combining the ResNet-50 classifier with the Mask R-CNN segmentation model constitutes a robust pipeline for integration into clinical decision-support systems, providing valuable assistance for healthcare professionals in skin lesion analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>deep learning</kwd>
        <kwd>skin cancer</kwd>
        <kwd>melanoma</kwd>
        <kwd>classification</kwd>
        <kwd>segmentation</kwd>
        <kwd>autoencoders 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>analyzing skin lesion photographs using deep learning methods, while the subject is the algorithms,
models, and tools for their effective monitoring and classification.</p>
      <p>To achieve this goal, the following key objectives have been defined:


</p>
      <p>To conduct an analysis of current research and scientific publications in the field of
automated skin lesion diagnostics, particularly in the context of segmentation and
classification methods.</p>
      <p>To investigate the effectiveness of various image processing algorithms for detecting and
analyzing skin lesions.</p>
      <p>To justify, select, and develop neural network models for building an automated skin image
analysis system, including data preprocessing steps and result evaluation methods.</p>
      <p>This study is aimed at creating an effective tool for medical professionals in the process of
diagnosing skin cancer. The proposed methods and models have the potential to improve the
accuracy of diagnostic decisions and support early detection of dangerous pathologies, which may
significantly impact patient survival rates.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature review</title>
      <p>
        In the article “Skin Cancer Detection Using Deep Learning – A Review” by Naqvi et al. (2023) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
five key networks are identified as the foundation for modern skin cancer detection systems: AlexNet
(Krizhevsky et al., 2012) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], VGG (Simonyan &amp; Zisserman, 2015) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], ResNet (He et al., 2016) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
DenseNet (Huang et al., 2017) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], and MobileNet (Howard et al., 2017) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. However, it is worth
noting that these architectures are relatively outdated, as they were proposed between 2012 and
2017. Nowadays, an increasing number of modern models are emerging, particularly
transformerbased and hybrid approaches, designed for enhanced performance in medical computer vision tasks.
Some of them, especially commercial solutions, are not accessible for academic research,
complicating their implementation. Therefore, adapting state-of-the-art architectures to the task of
skin lesion detection remains an important area of scientific investigation.
      </p>
      <p>One innovative direction involves the use of Spiking Neural Networks (SNNs) for mole image
classification. These networks mimic the way information is transmitted in the human brain, where
data is passed through short pulses rather than continuous signals, as in traditional networks [20].
A neuron in an SNN "decides" to send a spike when the accumulated information reaches a certain
threshold. SNNs aim to merge neurobiology and machine learning by employing biologically realistic
neuron models for computation [20].</p>
      <p>
        Gilani et al. [21] demonstrated that SNNs, trained with fewer parameters, can outperform
traditional CNNs in terms of F1-score and overall accuracy, while consuming significantly less
energy. However, specificity and precision remain lower compared to VGG-13, and the hardware
implementation of SNNs requires additional modules to process spiking events [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Abdar et al. [22] proposed a hybrid neural network model with uncertainty quantification, which
is crucial for understanding the reliability of deep neural network predictions. Traditional neural
networks do not provide information about the confidence of their decisions, which can be critical
for medical applications. To address this issue, various uncertainty estimation methods have been
developed, such as Monte Carlo Dropout (randomly deactivating neurons to generate different
predictions for the same input), Ensemble MC Dropout (creating several models with different
parameters, each generating predictions for the same input combined with Monte Carlo Dropout),
and Deep Ensemble (training multiple independently initialized models on different data subsets)
[22]. The proposed method achieved 88.95% accuracy and an F1-score of 0.909 on the ISIC 2019
dataset, indicating high potential [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>Lu and Firoozeh Abolhasani Zadeh [23] proposed a modified version of XceptionNet using the
Swish activation function (which combines properties of the linear function and ReLU). Xception,
first introduced by François Chollet (2017) [24], is based on depthwise separable convolutions – a
two-step convolutional operation that first performs spatial filtering on each channel individually
and then combines the extracted features using a 1×1 convolution. This design significantly reduces
the number of parameters and computational cost while maintaining a high capacity for capturing
complex spatial features.</p>
      <p>Thanks to the integration of the Swish function, the improved XceptionNet trained on HAM10000
data demonstrated excellent performance: classification accuracy reached 100%, and the F1-score was
0.953. Additionally, there was a notable improvement in metrics compared to other convolutional
networks, confirming the effectiveness of the proposed approach for mole detection tasks.</p>
      <p>
        Khan et al. [25] presented a fully automated CNN-based approach that combines preprocessing,
segmentation, and classification stages for skin lesion analysis. In the preprocessing phase, the Local
Color-Controlled Histogram Intensity Values (LCcHIV) method was used to enhance contrast and
normalize local skin region lighting. The enhanced images were then used as input for the
segmentation network. Segmentation was performed using a novel Deep Saliency Segmentation
method, which generates a heatmap and then converts it into a binary mask via a thresholding
function. After that, deep features were extracted using pre-trained ResNet101 and DenseNet201
models, and classification was performed using a Kernel Extreme Learning Machine. While the
model demonstrated high classification accuracy on the HAM10000 dataset, its segmentation
effectiveness was evaluated on a small set of 200 images only, indicating the need for further
validation on larger datasets [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>All of the above-mentioned methods show high performance and significant potential for
application in skin lesion detection, particularly due to innovative approaches such as spiking neural
networks, hybrid models with uncertainty estimation, or enhanced architectures like XceptionNet
with Swish activation. However, implementing such solutions requires a deep understanding of the
corresponding methods as well as access to open-source code or specific hardware. Given this, the
present study focuses on verified and more accessible architectures, emphasizing their adaptation
and practical application to the task of mole detection. This approach helps maintain a balance
between implementation complexity and the quality of the results obtained.</p>
      <sec id="sec-2-1">
        <title>2.1. Validation model</title>
        <p>
          In the task of mole detection in images, it is essential to synthesize a model capable of correctly
and reliably identifying their presence despite high variability in skin appearance and the presence
of noise. One-Class Classification (OCC) proves to be appropriate in this context, as it enables the
model to focus on learning only the positive class and detecting deviations from it [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          One of the most effective tools for implementing OCC is the autoencoder – a type of unsupervised
neural network capable of generating a compressed representation of input data [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Literature [
          <xref ref-type="bibr" rid="ref3 ref7">3,7</xref>
          ]
highlights that autoencoders are highly effective for anomaly detection tasks due to their ability to
reconstruct only those data that share common features with the training set. A significant
discrepancy between the original and reconstructed image indicates atypical input data, allowing the
detection of anomalies – such as the absence of a mole or its unusual appearance.
        </p>
        <p>
          In particular, in work [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], an autoencoder with fully connected layers was implemented and tested
on the MNIST dataset. Despite the small image size, the model showed high effectiveness (AUC =
0.960 ± 0.002), indicating the potential of autoencoders in one-class classification tasks. However, for
medical images characterized by more complex structures, it is advisable to apply convolutional
architectures and test them on real medical data with higher resolution.
        </p>
        <p>Thus, autoencoders are a justified choice for implementing OCC in the task of mole detection, as
they allow for effective modeling of the "normal" skin structure and detecting deviations from it.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Segmentation model</title>
        <p>
          One of the key challenges when using neural networks for skin cancer diagnosis is the presence
of artifacts in dermatoscopic images, such as hair, shadows, marker lines, bubbles, or rulers, which
can reduce classification quality [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. To improve results, segmentation is often used as a
preprocessing step, enabling the separation of relevant objects (e.g., moles) from the background and
extraneous elements. Given that several moles may be present in a single image, it is appropriate to
apply instance segmentation, which allows identifying each individual object within the same
category [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          One of the most common architectures for segmentation is Mask R-CNN – a model that combines
localization, classification, and pixel-level mask generation for each detected object [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The model
consists of several key components: a feature extractor (e.g., ResNet), a RPN (Region Proposal
Network), a RoI Align (Region of Interest Align) module for precise region alignment, and separate
branches for classification, bounding box regression, and binary mask generation (see Figure 1). This
combination ensures high segmentation accuracy even in cases of complex mole morphology.
        </p>
        <p>Modern versions of the YOLO model, particularly YOLOv8, combine high processing speed with
accurate object segmentation [30]. YOLOv8 is based on an improved CNN architecture that
effectively extracts image features while maintaining a balance between speed and accuracy. It
supports instance segmentation by generating precise masks for each detected object, making it
suitable for real-time applications. To train the model on a custom dataset, the COCO format must
be used, which includes object coordinates, polygon contours, and class information, providing
flexibility in representing complex annotations.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Classification model</title>
        <p>In the context of automated image classification tasks for detecting malignant skin lesions, the
use of deep convolutional neural networks (CNNs) is highly relevant. The complexity of this task is
due to the high visual similarity between different types of pathologies, the variability in mole
appearance, the presence of artifacts, and inconsistent lighting. Therefore, numerous studies focus
on comparing different architectures to determine the most effective approach.</p>
        <p>
          One of the first breakthrough architectures in the field of computer vision was AlexNet [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. It
introduced the use of deeper networks, ReLU activation functions, Dropout, and efficient max
pooling techniques. This approach not only achieved high classification accuracy on ILSVRC-2012
but also initiated the deep learning era in medical image analysis.
        </p>
        <p>
          This was further developed with the appearance of ResNet [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], which introduced residual
connections (skip connections) – an effective solution to the vanishing gradient problem when
training deep models. Using bottleneck blocks and Batch Normalization, the model enables stable
training even with considerable network depth, which is especially important for analyzing complex
dermatological images.
        </p>
        <p>
          The issue of limited computational resources prompted the development of more compact models,
such as SqueezeNet [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which achieves results comparable to AlexNet while having 50 times fewer
parameters. Thanks to its unique Fire modules combining 1×1 and 3×3 convolutions, the model
efficiently extracts features while maintaining low complexity, making it suitable for mobile
applications.
        </p>
        <p>
          Another direction of optimization involves multi-scale feature processing. In this context, the
Inception network model [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] stands out due to the use of modules that simultaneously analyze
information using convolutions of different sizes (1×1, 3×3, 5×5) and pooling. The architecture is
optimized by factorizing large filters and using auxiliary classifiers, improving the training quality
of deeper layers.
        </p>
        <p>
          Another modern approach is implemented in EfficientNet [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], which proposes scaling the model
in three dimensions simultaneously-depth, width, and image resolution (compound scaling). This
strategy allows achieving a better balance between performance and accuracy. Depending on
available computational resources, one can choose from multiple variants (from B0 to B7), which
provides additional flexibility.
        </p>
        <p>
          Among the classic models, it is also worth mentioning the VGG architecture, which, thanks to its
simple but deep structure (sequential 3×3 convolutions with ReLU and max pooling), demonstrates
stable accuracy across many tasks. Its main drawbacks remain the large number of parameters (~138
million) and high computational load, which limit its applicability in real-world medical settings [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>Thus, the literature presents a wide range of architectures, each with its advantages depending
on the requirements for accuracy, speed, and available resources. Comparing these models in the
context of skin cancer diagnosis enables a well-grounded selection of the most relevant solution.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Materials and methods</title>
      <sec id="sec-3-1">
        <title>3.1. Input data and sources</title>
        <p>
          For this study, we used the publicly available HAM10000 dataset (Human Against Machine with
10,000 training images) [
          <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
          ]. This dataset contains 10,000 images of moles, most likely captured
using a dermatoscope. All images have a resolution of 600×450 pixels and are stored in three-channel
RGB format.
        </p>
        <p>The dataset includes images from the following seven classes:
</p>
        <p>
          Actinic keratoses and Bowen’s disease are non-invasive skin lesions that can progress into
squamous cell carcinoma and are often caused by UV exposure (AKIEC – Actinic Keratoses
/ Bowen’s Disease, see Figure 2) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          Basal cell carcinoma is a common form of skin cancer that rarely metastasizes but can grow
destructively if left untreated (BCC – Basal Cell Carcinoma, see Figure 3) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          Benign keratosis includes seborrheic keratoses, lentigines, and lichen planus-like keratoses –
pigmented lesions that may mimic melanoma (BKL – Benign Keratosis, see Figure 4) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          Dermatofibroma is a benign skin lesion that often features a central white area and may
develop due to minor injuries (DF – Dermatofibroma, see Figure 5) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].


        </p>
        <p>
          Melanocytic nevi are benign tumors of melanocytes that typically exhibit symmetric
structures and homogeneous coloring (NV – Melanocytic Nevi, see Figure 6) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          Melanoma is a malignant tumor that can be effectively treated through surgical excision if
diagnosed early (MEL – Melanoma, see Figure 7) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          Vascular lesions, such as angiomas or hematomas, usually appear as red or purple spots with
well-defined borders (VASC – Vascular Lesions, see Figure 8) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>One of the key advantages of using the HAM10000 dataset is the availability of segmentation
masks. This enables not only image classification but also the improvement of segmentation
algorithms.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Balancing and preprocessing of input data</title>
        <p>
          Due to a significant class imbalance in the input dataset for classification (e.g., the NV class
contains 6705 images, while the DF class includes only 115), a balancing strategy was applied [26].
The oversampling method SMOTE [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] was used to increase the number of samples for minority
classes. Additionally, to ensure an even distribution of images across data subsets, stratified splitting
was employed.
        </p>
        <p>The balancing was performed using the following strategy:


</p>
        <p>A subset was selected for each class, not exceeding the maximum number of images per class.
Stratified splitting ensured that all subsets contained proportionally the same number
of images from each class.</p>
        <p>The final data split consisted of a training set (~83%), validation set (~10%), and test
set (~7%).</p>
        <p>
          To increase the model’s robustness to variations in lighting, scale, and image positioning,
moderate data augmentation was applied. All images were subjected to light blurring with a
probability of 30%, minor shifts, rotations up to 5°, and scaling. Additionally, brightness and contrast
adjustments were made, and pixel values were normalized to the [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ] range. This approach helped
improve the model's generalization ability while preserving key features (see Figure 9).
        </p>
        <p>The segmentation model was trained on data containing images with multiple moles and their
corresponding segmentation masks (see Figure 10). This approach allows for proper processing of
scenes with multiple lesions. To expand the dataset, 1000 synthetic images were generated by
randomly placing 2–4 mole fragments from the original HAM10000 dataset onto an artificial
background. Object scaling was applied, and overlaps were avoided. A corresponding segmentation
mask was automatically created for each image.</p>
        <p>Additionally, a validation neural network model based on an autoencoder was developed to detect
the presence of a mole in the image. The same augmentation strategy used for the classification
model was applied here, ensuring data consistency and improving the image processing
effectiveness.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Developed models and their characteristics</title>
        <p>As part of this study, neural network models were developed based on existing architectures, each
with specific structural features and advantages for image classification tasks. For example,
ResNet50, due to its residual connections, effectively minimizes the vanishing gradient problem and
achieves high accuracy. EfficientNet provides a balance between accuracy and the number of
parameters through optimal scaling of depth, width, and resolution. VGG, despite its simple
architecture, performs well in image processing tasks. For comparison, AlexNet was used – one of
the first successful CNN architectures that laid the foundation for deep learning development.
SqueezeNet, due to its compactness, processes images quickly without loss of accuracy. Inception,
by using filters of various sizes, efficiently extracts features, which is particularly useful for analyzing
complex structures of skin lesions.</p>
        <p>Image validation was performed using an autoencoder, specially designed for this task. As a
oneclass classifier, it was trained on mole images, enabling it to effectively reconstruct known patterns
and detect anomalous deviations related to the presence of skin lesions. The schema of the developed
autoencoder is shown in Figure 11.</p>
        <p>For segmentation, a neural network model based on Mask R-CNN was developed, which enabled
not only the identification of moles in images but also the precise delineation of their boundaries.
This is critically important for further analysis and diagnosis.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Neural network model optimization</title>
        <p>To ensure stable training of the neural network and reduce the risk of overfitting, a number of
well-established techniques were applied. In particular, the use of the Adam (Adaptive Moment
Estimation) optimizer enabled effective adaptation to the specifics of the data and contributed to fast
and stable convergence during training. Adam is one of the most popular optimizers due to its ability
to automatically adjust the learning rate for each parameter individually, thereby improving training
efficiency.</p>
        <p>To overcome overfitting, the early stopping technique was used. This involves halting the training
process when the model's performance on the validation set stops improving over a certain period.
This helps avoid overfitting and ensures better generalization to new, unseen data.</p>
        <p>In addition, a dynamic learning rate reduction approach was implemented when a “plateau” was
reached – i.e., when model performance metrics stopped improving. This helped further optimize
training and avoid stagnation.</p>
        <p>The use of these techniques collectively significantly improved the stability of the training
process, optimized the model parameters, and enhanced its generalization capability on new data.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Software tools used</title>
        <p>The models were implemented using the Python programming language and the PyTorch [27],
scikit-learn [28] and NumPy [29] libraries. Python was used for general data processing and
manipulation, PyTorch for efficient neural network construction and training, scikit-learn for
performance evaluation. NumPy was utilized for efficient mathematical operations and handling
large data arrays.</p>
        <p>The models were trained on a GPU (Graphics Processing Unit) in the Kaggle environment, which
significantly accelerated the training process and enabled the handling of large data volumes. The
use of such an environment ensured stable and fast data processing due to access to powerful
computational resources.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Research results and their discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Developed validation model</title>
        <p>
          An autoencoder was selected as a one-class classifier based on the study by Isuru Jayarathne and
Michael Cohen [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The model described in their work was adapted: instead of fully connected layers,
convolutional layers were used, which allowed for more efficient processing of large-sized images.
Additionally, moderate data augmentation was applied, which is described in detail in the section
“Balancing and Preprocessing of Input Data.”
        </p>
        <p>
          The loss function was calculated using Mean Squared Error (MSE), which is better suited for
multi-class classification of color images. This differs from the approach in the original paper [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ],
where Binary Cross Entropy Loss (BCELoss) was used, which is more appropriate for binary
classification, particularly in the case of grayscale MNIST images.
        </p>
        <p></p>
        <p>The Binary Cross Entropy Loss formula (see Formula 1):

1
[ log( ) + (1 −  ) log(1 −  )],
(1)
where N is the number of samples in the dataset, y₁ is the true label for the i-th sample (y  = 1 for
the positive class and y  = 0 for the negative class), and ŷ  is the predicted probability that the i-th
sample belongs to the positive class.</p>
        <p></p>
        <p>The Mean Squared Error formula (see Formula 2):
measured
,
where N is the number of samples in the dataset, x  is the pixel value of the original image, and
x̂   is the pixel value of the reconstructed image.</p>
        <p>To convert the autoencoder into a fully functional one-class classification tool, the PSNR (Peak
Signal-to-Noise Ratio, see Formula 3) method was selected, as it provides a more accurate assessment
of similarity between the input and reconstructed images.</p>
        <p>32
32
32
8
16
16
16
16
16
15
15
8
8
8
8
0.0001
0.0001
0.001
0.001
0.001
0.001
0.0001
0.01
0.001
0.0001
0.0001
0.002
0.002
0.002
0.002
medical analysis tasks.
vectors and computing their similarity.</p>
        <p>need of retaking.
where</p>
        <p>PSNR
in
decibels
(dB),</p>
        <p>denotes the maximum possible pixel value of the image (e.g., 1.0 for normalized images or
255 for 8-bit images) and MSE is defined in Formula (2).</p>
        <p>Preliminary analysis of other metrics-cosine similarity and MSE (Mean Squared Error) – showed
that they are not sensitive enough to small but important changes in the images, which is critical in
The classification method involves transforming both the original and reconstructed images into

</p>
        <p>If the PSNR value exceeds 25, the image is classified as correct (i.e., contains a mole).</p>
        <p>
          If the value is below the threshold, the image is classified as potentially anomalous or in
Additionally, synthetic data with artificially added moles was used for training. Several model
configurations were tested during training of the autoencoder by changing parameters such as latent
space size (latent_dim – the size of the compressed representation of the input data), learning rate,
and batch size. The platform Weights &amp; Biases [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] was used to monitor metrics and save model
weights.
        </p>
        <p>The table below (see Table 1) presents the training results for different model configurations; the
best results are marked in green, based on a gradient from worst (red) to best (green):</p>
        <p>batch_size learning_rate epoch</p>
        <p>Models
Autoencoder
Autoencoder
Autoencoder
Autoencoder
Autoencoder
Autoencoder
Autoencoder
Autoencoder
Autoencoder
Autoencoder
Autoencoder
Autoencoder
Autoencoder
Autoencoder
Autoencoder
Autoencoder
256
256
100
100
100
100
200
200
200
200
200
100
200
64
32
latent_dim
32
128
200
First configuration: batch_size = 32, learning rate = 0.0001, latent_dim = 256. Training lasted
for 17 epochs and was automatically stopped due to the absence of further improvement in
metrics. Obtained metrics: PSNR = 24.8586, loss = 0.0032. The analysis of the graphs indicated
no overfitting and stable model convergence (see Figure 12).</p>
        <p>Second configuration: batch_size = 32, learning rate = 0.001, latent_dim = 100. Training lasted
for 20 epochs. Obtained metrics: PSNR = 25.228, loss = 0.003. The graph analysis shows no
overfitting and stable model convergence (see Figure 13).</p>
        <p>Third configuration: batch_size = 32, learning rate = 0.0001, latent_dim = 200. Training ended
after 10 epochs. Metrics: PSNR = 25.07, loss = 0.0031 (see Figure 14).</p>
        <p>Analysis of the results shows that the reconstruction quality strongly depends on the latent space
size. Reducing the latent_dim to 32 significantly degrades accuracy, regardless of the learning rate.
Reducing the batch size slows down model convergence. As for the learning rate, 0.001 turned out
to be optimal: at 0.01, convergence was unstable, further reduction to 0.0001 did not yield significant
improvement. The obtained results confirm the importance of proper hyperparameter tuning for
stable and efficient autoencoder training.</p>
        <p>To evaluate the algorithm's performance, visual examples are provided for both successfully
reconstructed images and those with noticeable reconstruction errors (see Figures. 15–16).</p>
        <p>The developed model is used for the initial validation step of the service: the user uploads an
image, the autoencoder analyzes it, and returns a decision on whether a new image is needed or
whether the current image is of sufficient quality for further analysis.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Developed segmentation model</title>
        <p>After passing the validation check, the image is sent to the preliminary segmentation stage, for
which a neural network model based on Mask R-CNN was developed. The chosen model does not
require significant preprocessing of the data. Training was conducted in fine-tuning mode: most of
the pre-trained network weights were retained, except for the roi_heads module, which was adapted
to the specifics of the task.</p>
        <p>Experiments were conducted with various hyperparameter configurations, including changes in
the learning rate, choice of optimizer, and number of epochs. Satisfactory results were achieved after
just five epochs of training. The quality of segmentation was evaluated on both the training and
validation datasets, confirming the effectiveness of the chosen approach.</p>
        <p>An example of the resulting mole segmentation is presented (see Figure 17):</p>
        <p>The model demonstrated high IoU (Intersection over Union, see Formula 4) values even after
minimal modifications and only five epochs of training (see Figure 18).</p>
        <p>=    , (4)
where area of overlap is the region where the predicted bounding box and the ground truth
bounding box intersect, and area of union is the total area covered by both the predicted and ground
truth bounding boxes combined.</p>
        <p>However, the main issue remains the duration of training: one epoch of Mask R-CNN takes at
least 40 minutes due to the large amount of training data, even on a high-performance GPU.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Developed classification models</title>
        <p>To compare the performance of different neural network architectures, various models were
trained on the same dataset. This process included testing different model variations, particularly by
tuning hyperparameters such as batch size and learning rate. However, not all models showed
significant improvements even after hyperparameter optimization. Some models continued to yield
poor results, indicating their insufficient effectiveness for this study regardless of configuration
adjustments.</p>
        <p>Model performance was evaluated using the Accuracy (see Formula 5), Precision (Formula 6),
Recall (Formula 7), and F1-score (Formula 8) metrics on both the training and validation datasets.
The following formulas were used to compute these metrics:
0.864
0.900
0.131
0.724
,
+ 
,</p>
        <p>,
+  (7)
 ⋅ 
 1 = 2 ⋅  +  , (8)
where TP is the number of true positive predictions, TN is true negatives, FP is false positives,
and FN is false negatives.</p>
        <p>The table below (see Table 2) presents the training results of models with various configurations;
the best results are highlighted in green, following a gradient from worst (red) to best (green):
16
10</p>
        <p>According to the research findings, the best results were demonstrated by the following models
(see Table 3):</p>
        <p>To assess the training effectiveness of the best-performing models, graphs showing changes in
the training loss function were constructed (see Figure 19).</p>
        <p>Among the tested architectures, ResNet-50 achieved the best classification performance,
significantly outperforming others in key metrics. The VGG and EfficientNet-B0 models also showed
strong results, although they slightly lagged behind ResNet-50.</p>
        <p>On the other hand, architectures such as Inception v3, SqueezeNet, AlexNet, and EfficientNet-B5
were less effective for the given task. In particular:



</p>
        <p>Inception v3 achieved a validation F1-score of 0.690 and accuracy of 0.688;
SqueezeNet – F1-score of 0.596 and accuracy of 0.595;
AlexNet showed an F1-score of 0.701 and accuracy of 0.710;
EfficientNet-B5 significantly underperformed compared to other models, with an
F1score of just 0.105 and accuracy of 0.155.</p>
        <p>This may be due to their architectural characteristics, insufficient adaptation to the specific task,
or suboptimal hyperparameter settings. Figure 20 shows the loss dynamics for all models and
training runs for comparison:</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In the course of this study, various scientific publications were reviewed regarding the application
of neural networks in computer vision, particularly in dermatology. Several deep learning models
were tested for the classification and segmentation of skin images, and the most effective approaches
were identified based on empirical evaluation.</p>
      <p>For the skin lesion classification task, the custom model based on ResNet-50 achieved the best
results, with a validation accuracy of 88.6%, F1-score of 0.886, outperforming the EfficientNet-B0,
EfficientNet-B5 SqueezeNet, AlexNet, Inception v3 and VGG-based models tested under similar
conditions. This performance indicates a strong generalization capability even with a relatively
limited number of training epochs.</p>
      <p>For segmentation, the Mask R-CNN-based model demonstrated high reliability, achieving an
Intersection over Union (IoU) of 87%. It effectively detects all visible moles in an image, regardless of
size or location. This model is particularly valuable for distinguishing individual lesions from the
background, enabling the precise extraction of each mole for subsequent classification. Although
computationally intensive, its high precision in localized analysis supports its use for detailed lesion
identification and individual mole classification.</p>
      <p>An autoencoder model was also employed for preliminary mole presence detection, effectively
filtering out irrelevant images and contributing to overall pipeline efficiency. All the tested models
are planned to be integrated into a unified AI-based module for automated skin lesion analysis. The
system will include stages of initial image validation, mole segmentation, and lesion classification.</p>
      <p>Future work will focus on developing a clinical prototype capable of delivering actionable
recommendations. This integrated approach has the potential to significantly enhance early
detection of skin conditions, streamline dermatological workflows, and improve patient outcomes.
Declaration on Generative AI</p>
      <p>During the preparation of this work, the authors used GPT-4 to assist with grammar and spelling
check, paraphrasing and rewording, and improving the writing style. After using this tool, the
authors reviewed and edited the content as needed and take full responsibility for the publication’s
content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>World</given-names>
            <surname>Health</surname>
          </string-name>
          <article-title>Organization (WHO). Skin cancer factsheet</article-title>
          . https://www.iarc.who.int/cancertype/skin-cancer/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Kaggle</surname>
          </string-name>
          .
          <article-title>Skin Cancer : HAM10000 [dataset]</article-title>
          . https://www.kaggle.com/datasets/surajghuwalewala/ham1000-segmentation
          <string-name>
            <surname>-</surname>
          </string-name>
          and-classification
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Perera</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oza</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Patel</surname>
            , V. 
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>One-Class Classification: A Survey</article-title>
          .
          <source>arXiv preprint arXiv:2101</source>
          .03064. https://arxiv.org/abs/2101.03064
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Weights</surname>
            <given-names> </given-names>
          </string-name>
          &amp; 
          <string-name>
            <surname>Biases</surname>
          </string-name>
          . Documentation. https://docs.wandb.ai/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Tschandl</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosendahl</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kittler</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions</article-title>
          .
          <source>Scientific Data</source>
          , 
          <volume>5</volume>
          , 180161. https://doi.org/10.1038/sdata.
          <year>2018</year>
          .161
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Siddique</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , et al. (
          <year>2023</year>
          ).
          <source>Comparison of VGG-16</source>
          , VGG-
          <volume>19</volume>
          , and
          <article-title>ResNet-101 CNN Models for Suspicious Activity Detection</article-title>
          .
          <source>International Journal of Scientific Research in Computer Science</source>
          , Engineering and Information Technology, 
          <volume>8</volume>
          (
          <issue>1</issue>
          ). https://doi.org/10.32628/CSEIT2390124
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Jayarathne</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Autoencoder-based One-class Classification</article-title>
          .
          <source>326th SICE Tohoku Branch Workshop</source>
          . https://www.researchgate.net/publication/337929447_Autoencoder-based_Oneclass_Classification.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Chawla</surname>
            ,
            <given-names>N. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bowyer</surname>
          </string-name>
          , K. W.,
          <string-name>
            <surname>Hall</surname>
            , L. 
            <given-names>O.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kegelmeyer</surname>
            ,
            <given-names>W. P.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>SMOTE: Synthetic Minority Over-sampling Technique</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          , 
          <volume>16</volume>
          ,
          <fpage>321</fpage>
          -
          <lpage>357</lpage>
          . https://doi.org/10.1613/jair.953
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Naqvi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gilani</surname>
            ,
            <given-names>S. Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Syed</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marques</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>H.-C.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Skin Cancer Detection Using Deep Learning -</article-title>
          A Review. Diagnostics, 
          <volume>13</volume>
          (
          <issue>11</issue>
          ), 
          <year>1911</year>
          . https://doi.org/10.3390/diagnostics13111911
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Hafiz</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Bhat</surname>
            ,
            <given-names>G. M.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>A Survey on Instance Segmentation: State of the Art</article-title>
          . arXiv preprint arXiv:
          <year>2007</year>
          .00047. https://arxiv.org/abs/
          <year>2007</year>
          .00047
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sreya</surname>
            , &amp; Vinod Kumar,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>A Review on Instance Segmentation Using Mask R-CNN</article-title>
          .
          <source>Proceedings of the International Conference on Systems, Energy &amp; Environment (ICSEE)</source>
           
          <year>2021</year>
          . SSRN. https://ssrn.com/abstract=3794272
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . https://doi.org/10.1109/CVPR.
          <year>2016</year>
          .90
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G. E. (
          <year>2012</year>
          ).
          <article-title>ImageNet classification with deep convolutional neural networks</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          , 
          <volume>25</volume>
          . https://doi.org/10.1145/3065386
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Iandola</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
           N.,
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moskewicz</surname>
          </string-name>
          , M. W.,
          <string-name>
            <surname>Ashraf</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dally</surname>
          </string-name>
          , W. J., &amp;
          <string-name>
            <surname>Keutzer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and &lt;0.5 MB model size</article-title>
          .
          <source>arXiv preprint arXiv:1602</source>
          .07360. https://arxiv.org/abs/1602.07360
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , et al. (
          <year>2015</year>
          ).
          <article-title>Going deeper with convolutions</article-title>
          .
          <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          . https://doi.org/10.1109/CVPR.
          <year>2015</year>
          .7298594
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks</article-title>
          . arXiv:
          <year>1905</year>
          .11946. https://arxiv.org/abs/
          <year>1905</year>
          .11946
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv:1409</source>
          .1556. https://arxiv.org/abs/1409.1556
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , van der Maaten,
          <string-name>
            <given-names>L.</given-names>
            , &amp;
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. Q.</surname>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Densely Connected Convolutional Networks</article-title>
          .
          <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <fpage>2261</fpage>
          -
          <lpage>2269</lpage>
          . https://doi.org/10.1109/CVPR.
          <year>2017</year>
          .243
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
           G., et al. (
          <year>2017</year>
          ).
          <article-title>MobileNets: Efficient convolutional neural networks for mobile vision applications</article-title>
          . arXiv:
          <volume>1704</volume>
          .04861. https://arxiv.org/abs/1704.04861
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>