Synthesis of biomedical images based on generative
                                intelligence tools⋆
                                Oleh Berezsky1,∗,†, Petro Liashchynskyi1,†, Grygoriy Melnyk1,†, Maksym Dombrovskyi1,† and
                                Mykola Berezkyi1,†
                                1
                                    West Ukrainian National University, 11 Lvivska st., Ternopil, 46001, Ukraine


                                                   Abstract
                                                   The paper substantiates the use of generative intelligence tools to generate biomedical images. Analysis of
                                                   the literature is conducted on methods and techniques for generating images using GAN and diffusion
                                                   models. A new GAN architecture and algorithm have been developed for synthesizing cytological images
                                                   based on a diffusion model. The analysis focuses on established datasets used for training deep neural
                                                   networks. The widely recognized metrics for evaluating the quality of synthetic images are being analyzed:
                                                   IS, FID. Computer experiments were conducted for synthesis of cytological images based on GAN and Stable
                                                   Diffusion. The following results were obtained: diffusion model - FID – 0.63, IS – 3.99, GAN – FID – 3.39,
                                                   IS – 3.95.

                                                   Keywords
                                              cytological images, generative intelligence, image generation, generative adversarial networks, data sets,
                                diffusion model, IS metric, FID metric 1


                                1. Introduction
                                Generative intelligence has now become the pinnacle of research in artificial intelligence. Generative
                                intelligence systems allow you to generate texts, images, sounds, etc. Generative intelligence systems
                                are based on deep neural network models that are trained on large samples of data.
                                   Consequently, a variety of generative intelligence systems have emerged that transform text into
                                image, image into image, image into text, sound into text, text into sound, sound into sound. Text-
                                to-image transformation takes place on a fixed set of data. For this purpose, a transformer was used,
                                which autoregressively simulates text and graphic tokens [1]. The Codex GPT language model,
                                which is trained on GitHub, makes it possible to write code in Python. The paper [3] analyzes the
                                opportunities and risks of fundamental models, such as language, vision, reasoning. In addition, the
                                analysis of technical principles - the architecture of models, learning algorithms, data, is carried out.
                                The impact of generative intelligence on society has also been studied.
                                   Other papers [4] investigated a family of neurospeech models for LaMDA dialogue applications.
                                The model generates responses based on learning from known sources. The authors investigated the
                                LaMDA system in education.
                                   Generative intelligence has also found applications in medicine. The paper investigates the use of
                                generative intelligence in oncology, in particular for generating cytological images of breast cancer.
                                   Breast cancer is one of the most common cancers among women worldwide. Early diagnosis and
                                accurate determination of the stage of disease development are key factors for successful treatment
                                and reduction of mortality. Cytological, histological and immunohistochemical images are used to


                                IDDM’24: 7th International Conference on Informatics & Data-Driven Medicine, November 14 - 16, 2024, Birmingham, UK
                                ∗
                                  Corresponding author.
                                †
                                  These authors contributed equally.
                                   ob@wunu.edu.ua (O. Berezsky); p.liashchynskyi@gmail.com (P. Liashchynskyi); mgm@wunu.edu.ua (G. Melnyk);
                                m20052002@gmail.com (Dombrovskyi); mykolaberezkyy@gmail.com (M. Berezkyi)
                                    0000-0001-9931-4154 (O. Berezsky); 0000-0002-3920-6239 (P. Liashchynskyi); 0000-0003-0646-7448 (G. Melnyk); 0009-
                                0008-2287-416X (M. Dombrovskyi); 0000-0001-6507-9117 (M. Berezkyi).
                                              © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
detect pathologies. These images are a class of biomedical images. Cytological analysis of images of
cell preparations is one of the diagnostic methods, which allows the detection of pathological
changes at the cellular level [5].
   To train automatic systems for diagnosing breast cancer, large and high-quality datasets are
needed that reflect the variety of possible pathological changes. Datasets of cytological images of
breast cancer have the following features:

      Diversity of cell structures: normal cells, different types of atypical and malignant cells.
      Image variability: changes in color, lighting, focus, etc.
      Annotations & markups: availability of expert markup for supervised learning.

   The available datasets of real images are limited and poorly annotated.
   Therefore, an actual problem is the generation of biomedical images in oncology. This provides
the necessary accuracy in the classification of biomedical images. To solve this problem, the paper
uses the means of generative intelligence: GAN and diffusion models.

2. Literature review
Researchers in their works have developed a number of approaches to solving the problem of
generating biomedical images. In particular, the article discusses the problems of creating medically
significant fine-grained images of pulmonary adenocarcinomas using Stable Diffusion models [6].
The authors show how these models can be used to generate images with a limited number of
samples, which is important for medical research where data can be scarce.
    Other papers present the analysis of diffusion models in medical imaging [7]. The authors
consider modern methods and approaches in the processing of medical images using deep learning,
in particular diffusion models, which can significantly improve the quality of diagnostics.
    The paper [8] presents a novel generative model that uses Langevin dynamics to generate samples
by estimating gradients in data distribution with the addition of Gaussian noise. This avoids
problems with low-dimensional manifolds and improves sample quality.
    The paper explores how computer vision models trained on large sets of images from the Internet
automatically learn human social biases, such as racism and sexism [9]. This question becomes
important in the context of the ethical use of generative models.
    The authors' article [10] describes the process of synthetic data generation in digital pathology
using diffusion models. The authors present a comprehensive approach to assessing the quality of
the generated images, which can be useful for educational purposes.
    An article by A. Radford, J.W. Kim, C. Hallacy and other authors describes the CLIP model, which
is trained on large datasets of images and texts to perform a variety of computer vision tasks without
special training for each task [11]. The model demonstrates the ability to zero-learn on many
datasets, which opens up new possibilities for application.
    The authors R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer describe latent
diffusion models for high-resolution image generation [12]. They use autoencoders to reduce the
dimensionality of the data, which allows for a reduction in computational costs without losing image
quality.
    The article [13] presents the Imagen model, which is a text-to-image diffusion model with a high
level of photorealism. The model uses large language models to encode the text, which greatly
improves the quality of the samples.
    A paper by other researchers describes the use of diffusion probabilistic models for the synthesis
of histopathological images, which is important for pathology research [14].
    The paper [15] presents diffusion probabilistic models used to generate high-quality images.
These models demonstrate high-quality samples on various datasets, such as CIFAR10 and LSUN.
Thus, the analysis of literature sources indicates significant progress in the development of image
synthesis methods, in particular, through the use of diffusion models and GANs. This opens up new
possibilities for improving the quality and diversity of synthesized images in medical imaging.
    In the paper [16], researchers consider a deep learning approach using non-stationary
thermodynamics. They represent diffusion probabilistic models that gradually break down the
structure in the data through the diffusion process and then train the reverse process to reconstruct
the structure, creating a flexible and computationally efficient generative model.
    In the paper, the authors investigate diffusion models that are superior to generative adversarial
networks (GANs) in image synthesis tasks [17]. They demonstrate that diffusion models can achieve
high quality image samples, surpassing current generative models.
    The paper [18] presents the use of cascading diffusion models to generate high-quality images.
The cascade diffusion model consists of several stages, where each subsequent stage increases the
resolution of the image.
    The authors T. Karras, S. Laine, and T. Aila describe a new generator architecture for generative
adversarial networks (GANs) that borrows ideas from stylistic transference [19]. This architecture
allows for automatic and uncontrolled separation of high-level attributes from stochastic variations
in generated images.
    The paper describes a new approach to variational autoencoders (VAEs) for image generation
[20]. The NVAE network uses deep-cut convolutions and batch normalization to improve the quality
of generated images.
    The paper [21] describes a novel approach to generative modeling that uses stochastic differential
equations (SDE) to transform a data distribution to a simple noise distribution and vice versa. The
model achieves high results in image generation and demonstrates the capabilities for solving inverse
problems.
    The authors of another paper [22] developed a method for filling images using diffusion
probabilistic denoiseing models (DDPM). Based on this method, diverse and semantically meaningful
images can be generated, surpassing current GAN-based methods
    In [23], the authors describe improvements to diffusion probabilistic denoiseing models for image
generation. They use accuracy and completeness metrics to compare images. Experiments have
shown that diffusion models achieve higher completeness at similar values of the FID metric.
    The authors of another publication [24] developed an algorithm for stochastic variational
Bayesian inference. This approach allows you to train model parameters without using iterative
inference schemes.
    The authors of this publication have been analyzing biomedical images for over twenty years
under the guidance of Professor Oleh Berezsky. A number of publications reflect methods,
algorithms, and software tools for analyzing cytological, histological and immunohistochemical
images [25-31]. This is the result of a creative collaboration of researchers from West Ukrainian
National University and Ivan Horbachevsky Ternopil National Medical University.

3. Problem statement
Given: the set of real cytological images of 𝐼𝐶 . Image synthesis will be carried out on the basis of
GAN and networks that are built on DMN diffusion models. After generating by means of GAN, we
get a set of 𝐼𝐶𝐺 images. Using DMN, we get a set of 𝐼𝐶𝐷 images. In addition, we are given two metrics:
IS and FID.
    It is necessary to find the 𝑀𝐼𝑆 and 𝑀𝐹𝐼𝐷 distances between the set of real 𝐼𝐶 cytological images
and the sets of 𝐼𝐶𝐺 and 𝐼𝐶𝐷 synthetic images using the IS and FID metrics, i.e.:

   1.   𝑀𝐼𝑆 (𝐼𝐶 , 𝐼𝐶𝐺 ) and 𝑀𝐹𝐼𝐷 (𝐼𝐶 , 𝐼𝐶𝐺 );
   2.   𝑀𝐼𝑆 (𝐼𝐶 , 𝐼𝐶𝐷 ) and 𝑀 𝐹𝐼𝐷 (𝐼𝐶 , 𝐼𝐶𝐷 ).

   Compare:
   3.   𝑀𝐹𝐼𝐷 (𝐼𝐶 , 𝐼𝐶𝐷 ) and 𝑀𝐹𝐼𝐷 (𝐼𝐶 , 𝐼𝐶𝐺 );
   4.   𝑀𝐼𝑆 (𝐼𝐶 , 𝐼𝐶𝐷 ) and 𝑀𝐼𝑆 (𝐼𝐶 , 𝐼𝐶𝐺 ).

4. Analysis of image datasets
When creating datasets of cytological images, it is important to standardize the annotation, as it
ensures high quality, reliability and compatibility of data for their further use in machine learning
and diagnostic processes. In addition, proper annotation increases the efficiency of training AI
models, as well-defined labels reduce error rates in the learning process and help algorithms better
recognize cell features and pathological changes. When segmenting and annotating objects on
cytological images, it is important to adhere to the image annotation formats used in the PASCAL
VOC [32] and COCO [33] datasets.
    The APCData dataset [34] consists of cytological images of the cervix, developed in collaboration
with the laboratory of anatomical pathology and cytology, located in Rivera, Uruguay. The set
includes 425 images divided into 6 classes. The cells are labeled using bounding boxes and centers of
the nuclei.
    The dataset consists of 425 images of 2048 x 1532 pixels, corresponding to 73 diagnosed with
Papanicolaou test. A total of 3619 cells were annotated. The images were taken using the Olympus
CX40RF100 microscope and the Olympus LC30 Optical Microscope camera. Images are processed
using Olympus L.Cmicro software. Bounding boxes were created for cells in an appropriate format
for use with the YOLO convolutional neural network architecture.
    The UFSC OCPap dataset [35] contains 9797 annotated images of 1200x1600 pixels in size,
obtained from 5 slides with diagnosed oral tissue cancer and 3 healthy samples. The slides are
provided by the Hospital Dental Center of the University Hospital of the Federal University of Santa
Catarina. The dataset contains binary kernel masks and cell annotations in Json format. The images
are divided into subsets of training, validation, and testing. The images were taken using an Axio
Scan.Z1 microscope and a Hitachi HV-F202SCL camera. Dataset images are derived from virtual
slides measuring 214,000 x 161,000 pixels (0.111 μm x 0.111 μm per pixel). For annotation, medical
specialists used LabelMe and LabelBox tools.
    The authors have developed a database of cytological images of breast cancer [36]. The image
was obtained using a laboratory setup that includes a Delta Optical microscope, a Tucsen digital
CMOS camera with a resolution of 8 megapixels. The sources of microscope slides and diagnostic
information are provided by the Department of Pathological Anatomy with the Sectional Course of
Forensic Medicine of the Ternopil National Medical University. The database consists of 14 related
tables. The table of studies includes basic information about each study, its title, the object of the
study, as well as references to the patient and doctor associated with this study.
    All images of cytological samples are divided into 4 classes. The database supports several user
roles: physician, expert, administrator. The database contains information about the segmentation
algorithm used. For each cell there are the following features: area, perimeter, contour height,
contour width, contour circularity, center coordinates, main axis of inertia, minor axis of inertia,
angle of inclination of the main axis, Feret diameter, coordinates of the bounding rectangle,
roundness, compactness.

5. GAN-Based Artificial Image Synthesis
As you know, the architecture of modern GANs consists of a generator and a discriminator [37].
   The generator and discriminator architectures are based on cells. A cell consists of nodes
performing an append operation and operations between them.
   The following operations are used in the generator cell: convolution by kernel 11, 33, 55;
separable convolution by kernel 33; zero; skip connection. The cell architecture remains the same
for the entire generator model.
Figure 1: Generator Architecture

   In contrast to the generator, the set of operations in the discriminator cell is extended by two
operations: the maximum pooling by the kernel 33 and the average pooling by the 33 core. The
architecture of the generator is shown in Figure 1 and described in Table 1 and Table 2.

Table 1
Generator Architecture
              Layer                           Options                        Output= Form
            L1: Input                     Gaussian noise                        1128
  L2: Transposed Conv + ELU        Kernel = 4, stride = 1, padding
                                                                                441024
            activation                           =0
           L3: CELLG                         Nodes = 4                          441024
           L4: L2 + L3                                                          441024
         L5: Upsample                         Scale = 2                         881024
           L6: CELLG                          Nodes = 4                         881024
           L7: L5 + L6                                                          881024
         L8: Upsample                         Scale = 2                        1616512
           L9: CELLG                         Nodes = 4                         1616512
       L10: Self Attention              Input channels = 512                   1616512
       L11: L8 + L10 + L9                                                      1616512
         L11: Upsample                        Scale = 2                        3232256
          L12: CELLG                         Nodes = 4                         3232256
       L13: Self Attention              Input channels = 256                   3232256
      L14: L11 + L13 + L12                                                     3232256
         L15: Upsample                       Scale = 2                         6464128
                                        Kernel = 3, stride = 1,
        L16: Convolution                                                       6464128
                                            padding = 1
                                        Kernel = 3, stride = 1,
        L17: Convolution                                                        64643
                                            padding = 1
          L18: Output                                                          64643
   The discriminator architecture is shown in Figure 2 and described in Table 3 and Table 4.
   The generator takes a noise vector from a Gaussian distribution of 1×128 as input, and outputs
an image of 64×64×3.
   The number of nodes in the generator and discriminator cells is 4 and 5 respectively. There are
two skip connection operations in the generator cell, and 3 in the discriminator cell. There is also a
zero operation in the discriminator cell, which is not present in the generator. The Self-Attention
operation is applied 2 times in both the generator and the discriminator. However, in the generator,
this operation is placed towards the end of the network, and in the discriminator, on the contrary, it
is closer to the beginning.

Table 2
Generator CELLG Cell and Upsample Block Structure
                                      CELLG Cell Structure
                    L0: Input
        L1: Conv  ELU  Batch Norm                       Kernel = 3, stride = 1, padding = 1
                                                    Conv 3x3 = (Kernel = 3, stride = 1, padding =
   L2: L1 + Conv 33  Conv 11  ELU 
                                                                           1),
                   Batch Norm
                                                  Conv 1x1 = (Kernel = 1, stride = 1, padding = 0)
         L3: L2 + Conv (L1) + Conv (L0)                   Kernel = 3, stride = 1, padding = 1
                    L0: Input
        L1: Conv  ELU  Batch Norm                       Kernel = 3, stride = 1, padding = 1
                                   Upsample Block Structure
            L0: Input                                                          HWC
         L1: Upsample               Scale = 2, mode = nearest            (H  2)  (W  2)  C
                                 Kernel = 3, stride = 1, padding
        L2: Convolution                                                  (H  2)  (W  2)  C
                                                =1
  L3: Conditional Batch Norm         Number of classes = 4               (H  2)  (W  2)  C
 L4: Gated Linear Unit (GLU)     Dimension = 1                         (H  2)  (W  2)  (C / 2)


Figure 2: Discriminator architecture

6. Image Synthesis Based on Diffusion Model
Images based on the diffusion model are generated in the Stable Diffusion software environment.
The basic Stable Diffusion model is trained on a large dataset of images. Training on the basis of its
dataset takes place in the Hypernetwork neural network environment. This network adjusts the
weights of the base model. The algorithm for generating images based on the diffusion model
consists of the following steps:

   1.   training based on its dataset of images in the Hypernetwork environment;
   2.   the process of making noise of the initial 𝐼𝐶 dataset;
   3.   noise reduction process.

  Let's detail the steps. The initial dataset is transformed to a latency space: 𝐼𝐶 → 𝑍0𝐶 . Based on 𝑍0𝐶 ,
we calculate the noise value at each step t as follows:

                                  𝑍𝑡 =    𝛼𝑡 𝑍0𝐶 + 1 − 𝛼𝑡 𝜀𝑡 ,
    where 𝑎𝑡 is the coefficient that determines the noise rate at step t. The value of step t is selected
from the range 𝑡 ∈ [0, 𝑇] where T is the number of steps; 𝜀𝑡 – is the value of random Gaussian noise
at step t. Value 𝜀𝑡 calculated according to the expression:

                                       𝜀𝑡 : 𝑁(𝐸, 𝐷),
   where N is a normal distribution law with a expected value of E = 0 and a variance of D = 1.

Table 3
Discriminator architecture
                 Layer                                 Options                      Output Form
               L1: Input                                Image                         64643
     L2: Conv + ELU activation           Kernel = 3, stride = 1, padding = 1          646464
             L3: CELLD                               Nodes = 5                        646464
          L4: Self Attention                    Input channels = 64                   646464
           L5: L2 + L4 + L3                                                           646464
          L6: Downsample                              Scale = 2                      3232128
             L7: CELLD                               Nodes = 5                       3232128
          L8: Self Attention                     Input channels = 64                 3232128
           L9: L6 + L8 + L7                                                          3232128
          L10: Downsample                              Scale = 2                     1616256
             L11: CELLD                                Nodes = 5                     1616256
            L12: L10 + L11                                                           1616256
          L13: Downsample                              Scale = 2                      88512
             L14: CELLD                                Nodes = 5                      88512
            L15: L13 + L14                                                            88512
          L16: Downsample                              Scale = 2                      441024
       L17: Linear(Sum(L16))                                                             11
    L18: Sum(Multiply(Sum(L16),
                                                Number of classes = 4                    11
             Embedding))
            L19: L17 + L18                                                               11
             L20: Output                                                                 11
   The noise reduction value is calculated according to the expression:

                                           1            𝛽𝑡
                                  𝑍𝑡−1 =        𝑍𝑡 −       𝜀    ,
                                           𝛼𝑡          1−𝛼𝑡 𝑡
    where 𝜀𝑡 is the estimated noise value at step t; 𝛼𝑡 – the coefficient that determines the noise level
in the previous step t; 𝛽𝑡 is a coefficient that controls the level of noise reduction.
    After performing the noise reduction process (after traversing t=T steps), a 𝑍1𝐶 vector is formed
in the latency space. The encoder then transforms 𝑍1𝐶 into a set of 𝐼𝐶𝐷 images, with 𝐼𝐶𝐷 ≫ 𝐼𝐶 . The
quality of the generated images is checked by IS and FID metrics.

7. Metrics for Synthesized Image Evaluation
Two main metrics are used to assess the quality of synthesized images: the IS metric and the FID
metric.
   The IS metric is based on the Google Inception V3 neural network model for color image
classification. This metric was tested on the ImageNet dataset with a capacity of 1.2 million RGB
images, which are divided into 1000 classes.
   The analytic expression for the metric is as follows:
                        𝐼𝑆(𝐺) ≈ 𝑒𝑥𝑝(𝐸𝑥~𝑝𝑔 [𝐷𝐾𝐿 (𝑝(𝑦|𝑥) || 𝑝(𝑦))]),
   where 𝐸 is the math expected value; 𝑥~𝑝𝑔 shows what 𝑥 an image synthesized from a
distribution - 𝑝𝑔 (𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛); 𝐷𝐾𝐿 is the Kullback-Leibler distance between the
conditional probability distribution and the marginal distribution 𝑝(𝑦) [38].

Table 4
Discriminator CELLD Cell and Downsample block structure
                                        CELLD Cell Structure
                     L0: Input
         L1: Conv  ELU  Batch Norm                       Kernel = 3, stride = 1, padding = 1
    L2: L1 + Conv 33  Conv 11  ELU                   (Kernel = 3, stride = 1, padding = 1),
                    Batch Norm                            (Kernel = 1, stride = 1, padding = 0)
               L3: AvgPool 3 3 (L2)                             Kernel = 3, stride = 1
         L4: L0 + L3 + AvgPool 3 3 (L2)                         Kernel = 3, stride = 1
                                   Downsample block structure
             L0: Input                                                          HWC
                                       Kernel = 3, stride = 1,
         L2: Convolution                                                     H  W  (C  2)
                                            padding = 1
     L3: Pixel Rearrange              Kernel = 1, stride = 1,
                                                                        (H / 2)  (W / 2)  (C  2)
           Convolution                      padding = 0
             L4: ELU                                                    (H / 2)  (W / 2)  (C  2)
   The IS metric measures the average Kullback-Leibler distance between a conditional distribution
𝑝(𝑦|𝑥 ) and a marginal class distribution 𝑝(𝑦). The minimum value of the metric is 1, and the
maximum value is the number of classes.
   The FID metric compares the distributions of original and synthetic data. Based on this metric,
the distance between images is calculated as follows:

                                                  2                           1
               𝑑2 ((𝑚𝑟 𝐶𝑟 ), (𝑚𝑔 𝐶𝑔 )) = 𝑚𝑟 − 𝑚𝑔 + 𝑇𝑟(𝐶𝑟 + 𝐶𝑔 − 2(𝐶𝑟 𝐶𝑔 )2 ),
   where (𝑚𝑟 𝐶𝑟 ) and (𝑚𝑔 𝐶𝑔 ) are the average and covariance of the real and synthesized data
distributions respectively,
    𝑇𝑟 − sum of the diagonal elements of the matrix.
   Therefore, the smaller the value of the metric, the smaller the distance between the distributions,
that is, the images are more similar to each other [39]. The FID metric is sensitive to distortion in
images (shift, noise, etc.).

8. Computer experiments
Computer experiments on the synthesis of cytological images were carried out using GAN and Stable
Diffusion.
   To conduct computational experiments, a training set of cytological images was used, which was
published on the Zenodo platform [40].

8.1. Computer experiments with GAN
Images from the training dataset have been transformed to a resolution of 64×64 pixels (the original
resolution is 3264×2448). The initial number of images is around 100, which is not enough. Therefore,
the dataset is expanded to 800 images by applying affine transformations. By applying this technique,
the dataset was balanced – it contains the same number (200 images) for each class. To extend the
training dataset, Rudi own library with default parameters [41] was used. Images are randomly
rotated, flipped, scaled. All operations were applied with a probability of 50%.
   Hardware. The Python programming language and the Pytorch framework were used to write
the code. A virtual machine with the following configuration was used for experiments: 16 GB RAM,
10 vCPU x 2.2 GHz, Nvidia Tesla V100 GPU 16 GB (13.2 TFLOPS).
   Training Options. In experiments, Hinge Loss was used as a loss function and Adam optimizer
(betas = 0.5, 0.999). A technique called the Two Time-scale Update Rule is also used, which involves
the use of different learning norms for the generator and the discriminator. Accordingly, the learning
rate of the generator is 0.0001, and the discriminator is 0.0004. For all convolutional, deconvolutional,
and linear layers, the spectral normalization technique was applied in both models, which allows to
stabilize the learning process. Batch size – 128, number of iterations – 100,000. Training time ~13.6
GPU hours.
   Experiment results. The FID metric value is 3.39 (Class 1 – 3.42, Class 2 – 3.42, Class 3 – 3.35,
Class 4 – 3.37), and the IS metric value is 3.95
   Examples of synthesized images are shown in Figure 3.

8.2. Computer experiments in Stable Diffusion environment
Stable Diffusion is a powerful AI model for generating images from text prompts that operates in a
compressed latency space.
   The main features of Stable Diffusion are as follows:

   1.   model Type: Latent Diffusion Text-to-Image Model;
   2.   training: Dataset "laion-aesthetics v2 5+";
   3.   architecture: Encoder, CLIP ViT-L/14 text encoder, UNet core model with cross-attention;
   4.   optimization: AdamW, 32 x 8 x A100 GPU.

    Training Options. To train the model, the Linear loss function and the Adam optimizer were
used. 768, 1024, 320, 640, 1280 layers with linear activation and initialization of Normal weights were
chosen as the hypermodel structure. Batch size was set to 1 and Gradient Accumulation Steps to 1.
Gradient Clipping with a value of 0.1 was used to stabilize learning. The training took place with a
learning norm for the hypermodel of 0.00001. The total number of iterations was 20,000 steps, and
the size of the images was fixed at 512x512 pixels. The training was carried out using text prompts
based on a style_filewords.txt template. The intermediate results of the images were saved in the log
directory every 100 steps.
    Hardware. For the experiments, the infrastructure from Jarvis Labs was used, which has the
following computing resources:

   1.   GPU: 1 x A6000 Ampere (CUDA 12.3);
   2.   Processors: 7 CPUs;
   3.   RAM: 32 GB RAM;
   4.   Video memory: 48 GB VRAM;
   5.   Linux system version: 22.04.

   This configuration provides high performance for creating AI-generated images, allowing you to
effectively use the capabilities of the Stable Diffusion model to generate high-quality results.
   Experiment results. FID metric value – 0.63 (class 1 – 0.54, class 2 – 0.6, class 3 – 0.7, class 4 –
0.68). The value of the IS metric is 3.99.
   An example of real images is shown in Figure 4. An example of synthetic images is shown in
Figure 5.
Figure 3: Examples of synthesized images


Figure 4: Example of Real Images


Figure 5: Example of synthetic images
9. Discussions
Let's analyze the conducted computer experiments using GAN and Stable Diffusion. The results of
comparison of synthesized cytological images quality using the developed GAN architecture and
other known architectures are given in Table 5.

Table 5
Results of comparison with other GAN architectures
                     Method                                           FID
                     DCGAN                                           12,67
                     WGAN                                            12,72
                   WGAN-GP                                           19,09
                      BGAN                                           10,03
                     BEGAN                                           15,32
             Developed architecture                                   3,39
   Consequently, the developed GAN architecture provided better results in terms of FID metrics
than other well-known architectures.
   Let's analyze the advantages and disadvantages of generating images based on GAN and based
on diffusion models.
   The advantages of GAN are as follows:

   1.   The ability to generate high-quality, realistic images, video, and audio.
   2.   The ability to control the synthesis process (from the smallest details to common features in
        the image).
   3.   Relatively high speed of image synthesis, which is synthesized in one pass (forward pass) of
        the neural network.

   The disadvantages of GAN are as follows:

   1.   Significant computing resources and the need for expertise to learn effectively, making them
        less accessible.
   2.   Collapse mode, where the generator begins to produce a limited number of images, which
        reduces the variety of synthetic images.
   3.   The learning process is complex and long because GAN consists of two neural networks
        competing with each other.

   The advantages of diffusion models are as follows:

   1.   The ability to produce high-quality images that often surpass GAN in terms of realism and
        variety.
   2.   The ability to work with complex data distributions, which makes diffusion models universal
        for different areas.
   3.   A simpler learning process compared to GAN, which avoids the problem of collapse.

   The disadvantages of diffusion models are as follows:

   1.   Significant computing resources for training and generation, which may limit the availability
        of use.
   2.   Data generation using an iterative process is quite resource-intensive compared to the
        forward pass method used by GAN.
    Diffusion models transform noise distribution into data distribution through a diffusion process,
gradually improving the generated image. This process provides a high degree of control over the
generation process, as the model can be stopped at any point to obtain different levels of detail.
    However, GANs generate data in a single step, where the generator creates the image and the
discriminator evaluates it. Although this process is faster, it can lead to collapse mode, where the
generator produces a limited number of images.
    Consequently, GAN is built using the concept of competition between a generator and a
discriminator to create realistic images, while diffusion models transform noise into images through
an iterative process of diffusion (noise reduce). Diffusion models involve careful tuning of
hyperparameters and longer training times. In addition, both approaches require a large amount of
training data to perform optimally.

10. Conclusions
As a result, the tools for synthesizing cytological images have been developed and their comparison
has been conducted in the work.
   At the same time, the following results were obtained:

   1.   A new GAN architecture has been developed, which, unlike existing architectures, uses the
        Self-Attention mechanism in the generator and discriminator, which made it possible to
        improve the quality of synthesized images. The developed architecture for image synthesis
        supports the mechanism of image synthesis by labels (conditional generation), which is not
        relevant for the above architectures and approaches.
   2.   A new algorithm for the synthesis of cytological images based on diffusion models has been
        developed. In the Stable Diffusion environment, an algorithm for synthesizing cytological
        images was implemented, which made it possible to synthesize a sufficient sample of images
        for CNN training.
   3.    Computer experiments based on the diffusion model in the Stable Diffusion environment
        were carried out, and the following results were obtained: the value of the FID metric is 0.63
        (class 1 – 0.54, class 2 – 0.6, class 3 – 0.7, class 4 – 0.68), and the value of the IS metric is 3.99.
        Generating based on GAN provided the following results: FID – 3.39 (class 1 – 3.42, class 2 –
        3.42, class 3 – 3.35, class 4 – 3.37), IS – 3.95.

   Consequently, generation based on the diffusion model in the Stable Diffusion environment
showed better results compared to generation based on GAN.
   Therefore, further research will be the development of new diffusion models for generating
histological and immunohistochemical images.

Declaration on Generative AI
The authors have not employed any Generative AI tools.

References
[1] A. Ramesh, M. Pavlov, G. Goh, Zero-Shot Text-to-Image Generation, arXiv preprint (2021).
    doi:10.48550/arXiv.2102.12092.
[2] M. Chen, J. Tworek, H. Jun, Evaluating Large Language Models Trained on Code, arXiv preprint,
    (2021). doi:10.48550/arXiv.2107.03374.
[3] R. Bommasani, D.A. Hudson, E. Adeli, On the Opportunities and Risks of Foundation Models,
    arXiv preprint (2021). doi:10.48550/arXiv.2108.07258.
[4] R. Thoppilan, D. De Freitas, J. Hall, LaMDA: Language Models for Dialog Applications, arXiv
    preprint (2022). doi:10.48550/arXiv.2201.08239.
[5] F. Bray, M. Laversanne, H. Sung, J. Ferlay, R.L. Siegel, I. Soerjomataram, A. Jemal, Global cancer
     statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in
     185 countries, CA Cancer J Clin (2024). doi:10.3322/caac.21834.
[6] Y. Xu, J. Liang, Y. Zhuo, L. Liu, Y. Xiao, L. Zhou, TDASD: Generating Medically Significant Fine-
     Grained Lung Adenocarcinoma Nodule CT Images Based on Stable Diffusion Models with
     Limited Sample Size, Computer Methods and Programs in Biomedicine 248 (2024) 108103.
     doi:10.1016/j.cmpb.2024.108103.
[7] A. Kazerouni, E. Khodapanah Aghdam, M. Heidari, R. Azad, M. Fayyaz, I. Hacihaliloglu, D.
     Merhof, Diffusion Models in Medical Imaging: A Comprehensive Survey, Medical Image
     Analysis 88 (2023) 102846. doi:10.1016/j.media.2023.102846.
[8] Y. Song, S. Ermon, Generative Modeling by Estimating Gradients of the Data Distribution,
     NeurIPS 2019 (Oral) 2019. doi:10.48550/arXiv.1907.05600.
[9] R. Steed, A. Caliskan, Image Representations Learned With Unsupervised Pre-Training Contain
     Human-like Biases, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and
     Transparency (FAccT ’21), ACM, 2021 pp. 701–713. doi:10.1145/3442188.3445932.
[10] M. Pozzi, S. Noei, E. Robbi, L. Cima, M. Moroni, E. Munari, E. Torresani, G. Jurman, Generating
     Synthetic Data in Digital Pathology Through Diffusion Models: A Multifaceted Approach to
     Evaluation, bioRxiv preprint (2023). doi:10.1101/2023.11.21.23298808.
[11] A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
     J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models From Natural Language
     Supervision, arXiv preprint (2021). doi:10.48550/arXiv.2103.00020.
[12] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-Resolution Image Synthesis
     with Latent Diffusion Models, arXiv preprint (2021). doi:10.48550/arXiv.2112.10752.
[13] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S.K. Seyed Ghasemipour, B. Karagol
     Ayan, S.S. Mahdavi, R. Gontijo Lopes, T. Salimans, J. Ho, D.J. Fleet, M. Norouzi, Photorealistic
     Text-to-Image Diffusion Models with Deep Language Understanding, arXiv preprint (2022).
     doi:10.48550/arXiv.2205.11487.
[14] P. Azadi Moghadam, S. Van Dalen, K.C. Martin, J. Lennerz, S. Yip, H. Farahani, A. Bashashati, A
     Morphology Focused Diffusion Probabilistic Model for Synthesis of Histopathology Images,
     arXiv preprint, (2022). doi:10.48550/arXiv.2209.13167.
[15] J. Ho, A. Jain, P. Abbeel, Denoising Diffusion Probabilistic Models, arXiv preprint (2020).
     doi:10.48550/arXiv.2006.11239.
[16] J. Sohl-Dickstein, E.A. Weiss, N. Maheswaranathan, S. Ganguli, Deep Unsupervised Learning
     Using Nonequilibrium Thermodynamics, arXiv preprint (2015). doi:10.48550/arXiv.1503.03585.
[17] P. Dhariwal, A. Nichol, Diffusion Models Beat GANs on Image Synthesis, arXiv preprint (2021).
     doi:10.48550/arXiv.2105.05233.
[18] J. Ho, C. Saharia, W. Chan, D.J. Fleet, M. Norouzi, T. Salimans, Cascaded Diffusion Models for
     High Fidelity Image Generation, arXiv preprint (2021). doi:10.48550/arXiv.2106.15282.
[19] T. Karras, S. Laine, T. Aila, A Style-Based Generator Architecture for Generative Adversarial
     Networks, arXiv preprint (2019). doi:10.48550/arXiv.1812.04948.
[20] A. Vahdat, J. Kautz, NVAE: A Deep Hierarchical Variational Autoencoder, arXiv preprint (2020).
     doi:10.48550/arXiv.2007.03898.
[21] Y. Song, J. Sohl-Dickstein, D.P. Kingma, A. Kumar, S. Ermon, B. Poole, Score-Based Generative
     Modeling through Stochastic Differential Equations, arXiv preprint (2021).
     doi:10.48550/arXiv.2011.13456.
[22] A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, L. Van Gool, RePaint: Inpainting using
     Denoising Diffusion Probabilistic Models, arXiv preprint (2022). doi:10.48550/arXiv.2201.09865.
[23] A. Nichol, P. Dhariwal, Improved Denoising Diffusion Probabilistic Models, arXiv preprint
     (2021). doi:10.48550/arXiv.2102.09672.
[24] D.P. Kingma, M. Welling, Auto-Encoding Variational Bayes, arXiv preprint (2013).
     doi:10.48550/arXiv.1312.6114.
[25] O. Berezsky, P. Liashchynskyi, O. Pitsun, I. Izonin, Synthesis of Convolutional Neural Network
     architectures for biomedical image classification, Biomedical Signal Processing and Control 95
     (2024) 106325. doi:10.1016/j.bspc.2024.106325.
[26] O. Berezsky, P. Liashchynskyi, O. Pitsun, G. Melnyk, Method and Software Tool for Generating
     Artificial Databases of Biomedical Images Based on Deep Neural Networks, CEUR Workshop
     Proceedings, 2023 pp. 15–26.
[27] O. Berezsky, O. Pitsun, G. Melnyk, T. Datsko, I. Izonin, B. Derysh, An Approach toward
     Automatic Specifics Diagnosis of Breast Cancer Based on an Immunohistochemical Image,
     Journal of Imaging 9.1 (2023) 12. doi:10.3390/jimaging9010012.
[28] O. Berezsky, O. Pitsun, P. Liashchynskyi, B. Derysh, N. Batryn, Computational Intelligence in
     Medicine, In: S. Babichev, V. Lytvynenko (eds.), Lecture Notes in Data Engineering,
     Computational Intelligence, and Decision Making. ISDMCI 2022, volume 149 of Lecture Notes
     on Data Engineering and Communications Technologies, Springer, Cham, 2023.
     doi:10.1007/978-3-031-16203-9_28.
[29] O. Berezsky, P. Liashchynskyi, O. Pitsun, M. Berezkyy, Comparison of Deep Neural Network
     Learning Algorithms for Biomedical Image Processing, IDDM-2022, CEUR Workshop
     Proceedings, 2022, pp. 135–145.
[30] O. Berezsky, Y. Batko, G. Melnyk, S. Verbovyy, L. Haida, Segmentation of cytological and
     histological images of breast cancer cells, 2015 IEEE 8th International Conference on Intelligent
     Data Acquisition and Advanced Computing Systems: Technology and Applications (IDAACS),
     2015, pp. 287–292. doi:10.1109/IDAACS.2015.7340745.
[31] O. Berezsky, S. Verbovyy, T. Datsko, The intelligent system for diagnosing breast cancers based
     on image analysis, 2015 Information Technologies in Innovation Business Conference (ITIB),
     Kharkiv, Ukraine, 2015 pp. 27–30. doi:10.1109/ITIB.2015.7355067.
[32] The        PASCAL         Visual      Object       Classes      Homepage,       2014.       URL:
     http://host.robots.ox.ac.uk/pascal/VOC/
[33] Common Objects in Context dataset, 2021. URL: https://cocodataset.org
[34] P.     Cuña       Cabrera,     APCData        cervical     cytology      cells,  2024.      URL:
     https://data.mendeley.com/datasets/ytd568rh3p/1. doi:10.17632/YTD568RH3P.1.
[35] André Victória Matias, UFSC OCPap: Papanicolaou Stained Oral Cytology Dataset (v4), 2022.
     URL: https://data.mendeley.com/datasets/dr7ydy9xbk/1. doi:10.17632/DR7YDY9XBK.1
[36] O. Berezsky, T. Datsko, G. Melnyk, V. Nykoliuk, O. Pitsun, S. Verbovyy. Database of Digital
     Histological and Cytological Images "ВРСІ2100". Database. Copyright registration certificate
     number 75359, December 14, 2017, bulletin No. 47 from January 26, 2018.
[37] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.C. Courville, Y.
     Bengio, Generative adversarial networks, Communications of the ACM 63 (2014) 139–144.
[38] S. Barratt, R. Sharma, A Note on the Inception Score (2018). doi:10.48550/ARXIV.1801.01973.
[39] A. Borji, Pros and cons of GAN evaluation measures, Comput. Vis. Image Underst. 179 (2019)
     41–65. doi:10.1016/j.cviu.2018.10.009.
[40] O. Berezsky, T. Datsko, G. Melnyk, Cytological and histological images of breast cancer, 2023.
     URL: https://doi.org/10.5281/zenodo.7890874. doi:10.5281/zenodo.7890874.
[41] Rudi, 2024. URL: https://github.com/liashchynskyi/rudi.