<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>December</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Simulated Datasets Generator for Testing Data Analytics Methods</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Serhii Toliupa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Pylypenko</string-name>
          <email>anna.pylypenko@knu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleh Tymchuk</string-name>
          <email>oleh.tymchuk@knu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleksii Kohut</string-name>
          <email>oleksii_kohut@knu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>64/13 Volodymyrska St, Kyiv, 01601</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <fpage>9</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This article explores the role of benchmark datasets in testing data analysis tools. The use of synthetic data generators is motivated by the need for scaling the size of training datasets, filling data gaps, and safeguarding data confidentiality. Existing research in this field emphasizes the importance of applications in various domains, such as fraud detection, healthcare, and computer graphics. The efficacy of constructing Gaussian distributions using the Box-Muller transform is investigated, while limitations in generating extreme values are highlighted. The integration of specialized distributions is proposed to address this gap and enhance dataset variability for improved performance in data analytics methods. The article presents a synthetic data generator capable of producing datasets for effective evaluation of machine learning methods. Practical tests demonstrate the software's effectiveness in creating test datasets with controlled variations. Four different tests were conducted, each with three different variants: 1) normal distribution parameters, namely standard deviation, 2) class imbalance, 3) missing values, and 4) outliers. The generated datasets were used to conduct controlled tests of logistic regression. Performance evaluation of the logistic regression model employed metrics such as Precision, Accuracy, Type 1 Error, Type 2 Error, and F1-Score. Benchmark datasets creation, synthetic dataset generator, analysis methods, extreme values.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The modern world generates an enormous volume of data every day, ranging from sensor data to
textual information. Data analytics tools require the availability of realistic and representative datasets
for testing that would reflect contemporary challenges in data processing. The relevance of this topic is
also driven by the development of artificial intelligence and machine learning: the high demand for the
development of machine learning and artificial intelligence algorithms necessitates data for training and
evaluating these algorithms. Creating realistic datasets becomes critically important for precise
assessment and comparison of various methods. The complexity of this problem is associated with its
interdisciplinary nature. Analytical tools and methods are used across various domains, including
medicine, finance, biology, ecology, and others. The creation of benchmark datasets can also contribute
to addressing ethical concerns related to data privacy and security by developing anonymized or
synthetic data. So, the creation of benchmark datasets for testing data analytics tools is essential for
advancing science and technology in this field and ensuring their practical utility across diverse
domains.</p>
      <p>
        There are a lot of published research results that provides comprehensive insights into the process
of creating benchmark datasets and its significance. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the authors introduce the concept of
benchmark metrics in machine learning for scientific purposes and review existing approaches. They
underscore that the selection of the most appropriate machine learning algorithm for scientific data
analysis remains a significant challenge due to the potential applicability of machine learning
frameworks and models, computer architectures. The results of the research by authors Krizhevsky,
Sutskever, and Hinton [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] were the first to demonstrate the profound influence of datasets on the
      </p>
      <p>2023 Copyright for this paper by its authors.
CEUR</p>
      <p>
        ceur-ws.org
learning outcomes of deep neural networks. It can be instrumental in comprehending the significance
of benchmarking. The article [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] by M. Fernandez-Delgado, E. Cernadas, S. Barro and D. Amorim is a
significant contribution to the field of machine learning and data classification, providing practical
insights into the use of classifiers in real-world scenarios. The article explores various classification
algorithms and compares their performance across different benchmark datasets, which can serve as a
valuable resource for discussing the effectiveness of various analytical tools.
      </p>
      <p>
        Each data generator program uses a unique approach to data creation. The article [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] presents a data
generator designed to fill in gaps that may exist in other programs. The developed system allows users
to customize and create known statistical distributions to achieve the desired outcome. Additionally, it
offers real-time data behavior visualization to analyze whether they possess the characteristics
necessary for effective testing. In the articles [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], the authors provide an overview of the design and
architecture of the Information Discovery and Analysis Systems (IDAS) Data Set Generator (IDSG),
which enables a fast and comprehensive evaluation of IDAS. IDSG generates data using statistical
algorithms, rule-based algorithms, and semantic graphs that represent interdependencies between
attributes. To illustrate this approach, an application for credit card transactions is used. Sran Popić et
al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] provided a brief overview of various types of generators in terms of their architecture and
anticipated usage, as well as listed their advantages and disadvantages. They also presented a review of
the data generation algorithms used and best practices in various domains.
      </p>
      <p>
        Researchers in this field has attempted to assess the utility of synthetic data generators using various
evaluation metrics. However, it has been found that these metrics lead to conflicting conclusions,
complicating the direct comparison of synthetic data generators. In their study, Fida Kamal Dankar and
colleagues [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ] identified four criteria for evaluating masked data by categorizing available utility
metrics into different categories based on the information they seek to preserve: attribute fidelity,
twodimensional parameter fidelity, population fidelity, and application fidelity. In the article [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the
authors have introduced several novel and efficient methods and multidimensional data structures that
can enhance the decision-making process in various domains. They have examined online range
aggregation, range selection, and weighted range median queries; for most of these, data structures and
techniques are presented that can provide answers in near-polynomial time.
      </p>
      <p>
        In practice, obtaining real data can be challenging due to confidentiality issues. Additionally, real
data may not conform to specific characteristics required for evaluating new approaches under certain
conditions. Given these constraints, the use of synthetic data becomes a viable alternative to supplement
real data in various domains. For example, in the article [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the authors described the process of
generating synthetic data using the publicly available tool Benerator to mimic the distribution of
aggregated statistical data obtained from the national population census. The generated datasets
successfully replicated microdata containing records with social, economic, and demographic
information. Forensics also requires testing digital information tools. Thomas Göbel et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
introduced the concept of a structure called hystck for creating synthetic datasets based on the ground
truth. This framework supports automated generation of synthetic network traffic and artifacts of
operating systems and software by simulating human-computer interactions. To preserve
confidentiality, banks are unwilling to share fraud statistics and datasets with the public. To overcome
these limitations, Ikram Ul Haq et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] introduced an innovative technique for generating uniformly
distributed synthetic data (HCRUD) based on highly correlated rules. This technique allows the
generation of synthetic datasets of any size, replicating the characteristics of restricted actual fraud data,
thus supporting further research in fraud detection. Access to medical datasets is also complicated due
to concerns about patient confidentiality. The development of synthetic datasets that are realistic enough
for testing digital applications is considered as a potential alternative that allows their deployment.
Theodoros Arvanitis et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] have devised a method for generating synthetic data statistically
equivalent to real clinical datasets and have demonstrated that the approach based on Generative
Adversarial Networks aligns with this goal. Thus, the concept of creating realistic medical synthetic
datasets has been successfully validated. However, data quality issues exist both in real and synthetic
data, with the latter reflecting real-world problems and artifacts created by synthetic datasets. The
intellectual analysis of synthetic healthcare data represents a novel field with its unique challenges.
According to Alistair Bullward et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], researchers should be aware of the risks associated with
extrapolating results from synthetic data studies to real-world scenarios and should evaluate outcomes
using analysts who can review the underlying data. Synthetic data is frequently utilized in computer
graphics, which is used for training computer vision models, as mentioned in [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ]. In many
industrial computer vision tasks, deep learning methods such as convolutional neural networks have
been successfully employed, as indicated by the works [
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ]. In recent years, generative adversarial
networks (GANs) have been effectively utilized for generating new realistic images and manipulating
them, as noted in the research by I. H. Rather and S. Kumar [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>Therefore, the increasing availability and utilization of data analytics tools make the standardization
of benchmark datasets an essential task for their further adoption across various fields. The challenge
of developing more objective metrics and methods for assessing the utility of synthetic data remains
unresolved. One of these problems is adequate tail-end modeling of probability density functions.
Extreme values can significantly impact risks and outcomes in several sectors such as finance,
insurance, climatology, engineering, among others. This article presents research in the field of creating
synthetic data that aligns with real-world requirements and enables their effective use for testing various
analytical tools. The primary focus is on comprehending and analyzing characteristics of this generator,
especially in the tail areas of the Gaussian probability density function.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Mathematical methods</title>
      <p>
        The generation of Gaussian distributions is foundational in various scientific and computational
fields, playing a pivotal role in modeling natural phenomena and simulating random variables. To create
a Gaussian distribution, it is suggested to use the Box-Muller transform [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. It produces a pair of
Gaussian random numbers using a pair of uniform numbers. The fundamental principle of the
BoxMuller algorithm lies in its ability to generate pairs of independent standard Gaussian random variables
from uniformly distributed random numbers. This method leverages the polar coordinate representation
in a two-dimensional space to transform pairs of independent uniformly distributed variables into sets
of normally distributed variables. By employing trigonometric functions and geometric interpretations,
this algorithm constructs Gaussian values by utilizing the magnitude and angle derived from uniformly
generated random variables. The algorithm to generate the Gaussian samples:
1. Generate sample using two distinct uniform random number generators,  ₁ and  ₂.
2. Apply the inverse cumulative distribution function (CDF) of the exponential distribution to  ₁
( = 1):
      </p>
      <p>= √−2 (1 −  1) = √−2 ( 1),
where  is the distance from origin for each sample.</p>
      <p>For simplicity,  ₁ is replaced by 1 −  ₁ since they are both uniform samples on (0, 1).
3. Apply the inverse CDF of the uniform distribution on (0, 2 ) to  ₂:</p>
      <p>= 2  ₂,
where  is the angle of the sample.
4. Finally, determine the x and y using basic trigonometric calculations:</p>
      <p>=  ( ),  =   ( ). (3)</p>
      <p>The Box-Muller algorithm generates two independent random numbers upon each execution. Each
pair of generated numbers represents two independent random variables following a standard normal
distribution. These variables possess a mean of 0 and a standard deviation of 1. The produced values
can be utilized as required for various statistical or computational purposes. For instance, both numbers
from each pair can be employed to generate pairs of normally distributed random variables.
Alternatively, a single number from each pair might suffice if the need is for a singular normally
distributed value. By combining both sets of generated values into a single dataset, a larger and
potentially more diverse sample can be constructed, enriching the dataset used for testing machine
learning models. This integration allows for a broader spectrum of data points, enhancing the robustness
of the evaluation process and potentially fortifying the model's generalization capabilities.</p>
      <p>
        The Box-Muller algorithm, while efficient in generating core values from the standard normal
distribution, may necessitate additional methods to generate values from the tails of the distribution
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Extreme or outlier values that reside in the tails of the distribution are often critical for assessing
rare events or evaluating the performance of algorithms. The generation of extreme values in random
numbers can be achieved using specialized distributions. Some distributions, such as the exponential,
(1)
(2)
Weibull, Fréchet extreme value distribution, among others, directly model extreme values. The
proposed algorithm incorporates the following steps to generate such values.
      </p>
      <p>Continuation of the algorithm to generate the Gaussian samples:
5. Set the parameter  represents the shape parameter governing the tail behavior. It determines
the shape and heaviness of the distribution's tails.</p>
      <p>с &gt; 0: indicates a distribution with bounded tails. It implies that the distribution's tails are bounded,
and the probability of encountering extreme values decreases more rapidly than in a normal distribution.</p>
      <p>с = 0: corresponds to the exponential distribution, where the tails are light, and the probability of
extreme values decreases exponentially.</p>
      <p>с &lt; 0: indicates a distribution with heavier tails than the exponential distribution. This suggests
that the probability of encountering extreme values decreases slower than in an exponential distribution.
6. Establish the probability density function of the distribution is given by the formula:
use the inverse transform method to generate extreme values:
 ( ;  ) = {
 −[1+ ]−1/ ⋅ (1 +  )−1/ −1</p>
      <p>for  ≠ 0,
 − − ⋅  − for  = 0,
1
 = {
(− (1 −  )− − 1) for  ≠ 0,
− (− (1 −  ) for  = 0,
where  is a random variable from a uniform distribution on the interval (0, 1) and can be obtained
as a fraction of random variables generated in step 1. Validate the generated extreme values to ensure
they align with the expected tail behavior.</p>
      <p>7. After obtaining the standard normal random variables and extreme values  ∶=
 ( ,  ,  ), can move on to a normally distributed value with mathematical expectation 
and standard deviation  :</p>
      <p>=  +  .</p>
      <p>
        Depending on the practical task there might be a need to shift the distribution along the axis of values
or change its scale. In such cases, it would be advantageous to employ the Generalized Extreme Value
(GEV) distributions, which combine the Gumbel, Fréchet, and Weibull families, also known as type I,
II, and III extreme value distributions [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. These distributions offer flexibility in adjusting the
positioning and scaling of the distribution to accommodate various scenarios and analyses.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Development of synthetic dataset generator</title>
      <p>In the domain of data generation for machine learning research, a spectrum of tools and libraries,
including MOSTLY.AI (https://mostly.ai/), Mockaroo (https://www.mockaroo.com/), and Scikit-learn
(https://scikit-learn.org/stable/datasets/sample_generators.html), among others, is at researchers'
disposal. These tools furnish a fundamental framework for the generation of synthetic datasets, enabling
users to craft data with predetermined statistical distributions. Nevertheless, they exhibit inherent
limitations. For instance, scikit-learn is primarily constrained by its capacity to generate synthetic
datasets featuring a limited array of distributions and parameters, rendering it less suitable for the
creation of intricate or verisimilar data representations. These tools primarily target numerical data and
lack the specialized capabilities required for the synthesis of structured data types, such as textual
information, categorical attributes, or time series. In the subsequent sections of this article, the
utilization of synthetic data will be examined primarily in the context of testing machine learning
methods, considering the noted shortcomings of existing tools.
3.1.</p>
    </sec>
    <sec id="sec-4">
      <title>Data model</title>
      <p>The specific capabilities of a synthetic data generator may vary from implementation. However,
such software should have the following main features:
 defining the name, description, and additional information associated with the model;
(4)
(5)
(6)
 describing all dimensions of the model, including their data type, name, description, and most
importantly, the algorithm for generating the value;
 export synthetic data rows to a file or database based on the created model.</p>
      <p>
        Each data model must have at least one dimension. Each dimension should be defined by the
following characteristics: dimension name; data type (integer or real number, category, string);
expression/formula that defines how the data will be filled in; additional options (presence of blank
values/outliers). Values in columns can be either independent expressions or calculated based on values
in other columns (while avoiding cyclic dependencies, where, for instance, expression A relies on the
value from column B, and expression B relies on the value from column A). The data model serves as
an abstraction of a dataset, comprising specifications that characterize the behavior of the data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Figure 1 depicts an example of rudimentary data model illustrating the characterization of specific
individuals for the purpose of analyzing patient ages. There are three dimensions:
      </p>
      <p>1) Name: an informative dimension, intended more for identifying the string than for analyzing
the data. Such strings are not useful in machine learning, but if it were real data, there would be a
privacy issue. Given that this string is generated in a random manner, there is no need for concern
regarding the utilization of personal data belonging to individuals. Since this dimension is defined
without modifiers, the data in this row will always be present.</p>
      <p>2) Age: dimension determines the patient's age. Its expression defines a random value in a normal
distribution with a mean of 14.0 and a variance of 3.0. There are no blank fields.</p>
      <p>3) IsAdult: dimension determines whether the patient is an adult. This is the only dimension that
uses another dimension in its calculations. If the patient is over 18, this field is set to 1, otherwise 0.
There are no blank fields.</p>
    </sec>
    <sec id="sec-5">
      <title>Modelling of the generator</title>
      <p>Editing a data model includes editing general information about the model, such as the name,
description, and availability to other users. Most importantly, in this mode, you can edit each dimension
separately, with a real-time check for the correctness of the entered data. Dimensions can be added or
deleted, but each model will always have at least 1 dimension with a line number. Each dimension must
be given a unique name that will not match other dimensions within the model. All dimensions must be
one of the defined data types (see Table1).</p>
      <p>The lists of operands and functions available for generation is described in Tables 2-4.</p>
    </sec>
    <sec id="sec-6">
      <title>Usage scenarios: generating test datasets</title>
      <p>To verify usefulness of such a software, practical tests were performed with the help of the program.
A data model was specified in the generator and multiple tweaks were applied to it. The data set was
created using the proposed algorithm to generate the Gaussian samples (1) – (6). The default model is
described as follows:
 two classes;
 two dimensions: one dimension for the class, another for the value;
 value is generated depending on class with the help of ternary operator and normal distribution;
 no outliers, no blank values.</p>
      <p>The dataset was specified on the data model tab. For each test a CSV file with 1000 entries was
generated. Then, with the help of Python, graphs for this data were drawn. This provides a visual clue
about how the change affects the data. A total of four different tests were performed, each with three
different variations.</p>
    </sec>
    <sec id="sec-7">
      <title>3.3.1. Changing parameters of normal distribution</title>
      <p>In this test, the standard deviation of normal distribution for both classes were changed. The values
chosen are 5.0, 3.0, 1.0. The expected result of such change is that values would become less dispersed
across the axis. The result of the software for this case is presented in Figures 2-3.</p>
      <p>For the current test, a swarm plot was used. The color represents the class of the item, and its position
on the Y axis represents the value. With each next plot the values are getting more packed around central
value, confirming that standard deviation is indeed changing.</p>
    </sec>
    <sec id="sec-8">
      <title>3.3.2. Changing class distribution</title>
      <p>Here, a weighted category was introduced in place of default category. For class 1, the probability
of appearance will decrease each time, and for class 2 it will increase. The rate of change is 10%. So,
the first distribution of classes will be 50-50%, then 40-60%, then 30-70% (Figures 4-5).</p>
      <p>As can be seen on the Figure 5, the amount of class2 entries in getting higher with each next picture,
confirming that this feature works.</p>
    </sec>
    <sec id="sec-9">
      <title>3.3.3. Adding outliers</title>
      <p>To use outliers, an additional dimension was added. A random value is calculated with the help of
uniform distribution, and if it’s less than certain threshold (which equals probability of such event), then
the value would be multiplied by 5 (Figures 6-7). Otherwise, the expected value will be placed.</p>
      <p>It’s clear that with the increasing probability of outliers appearing, number of outliers will be bigger.
Also, it’s worth noting that class2 outliers reach higher values than class1 because of bigger base value.
This creates additional separation between class1 and class2 in the higher values.</p>
    </sec>
    <sec id="sec-10">
      <title>3.3.4. Setting up missing values</title>
      <p>It’s possible to make missing values in the dataset in a similar manner to outliers, with the help of
an additional dimension. The only change is that missing() function is used instead of multiplication
(Figures 8-9).</p>
      <p>For this test, an event plot was used instead of swarm plot. Colored lines represent objects generated
(from 1 to 1000) with corresponding class. Each black line represents a missing value.</p>
    </sec>
    <sec id="sec-11">
      <title>3.4. Software description</title>
    </sec>
    <sec id="sec-12">
      <title>3.4.1. Module description</title>
      <p>Software implementation of this synthetic dataset generator (lexData) consists of the three main
parts: tokenizer, parser, and calculator. The input of this system is text expression describing formula
for the dimension value. Figure 10 depicts how the text input can be processed into result:</p>
      <p>In this simple example, there are multiple steps. In the first step, the input string is parsed into
multiple tokens. A token is an atomic part of the mathematical formula. A single plus sign is a token,
but a whole number or variable name is also a token, since we can’t just divide the number into digits.
After that, the sequence of tokens is handled by the parser, which builds a complete function. After
passing concrete values for parameters (which may be other dimensions) specified in the function, the
final value can be calculated and returned as the result. However, there are many more details such as
verifying if referenced function exists or if there is no circular dependency. The output of calculation is
just a single number or category, depending on the expression specified.</p>
    </sec>
    <sec id="sec-13">
      <title>3.4.2. Constraints and decisions</title>
      <p>The .NET Framework and programming language C# were used by the designer of generator
software. The generator core is implemented on top of custom-created library for parsing and evaluating
math expressions, called lexCalculator. To integrate it with this software, it was modified to support
such functions:
 generating random distributions;
 string data types;
 logical expressions;
 complex data types and lists.</p>
      <p>With the help of NUnit, a unit testing was performed on this software. Most of the possible test cases
were checked and tests provided more than 90% of code coverage.</p>
      <p>The system was tested on the following system specifications:
 Windows 10 Pro;
 Intel(R) Core(TM) i5-8350U CPU;
 16GB RAM;
 NVMe SK hynix 256GB SSD.</p>
      <p>With such specifications, it was found out how the generator performed on the different datasets.
The following table describes how many rows were generated per second on average for the specified
dataset. Each dimension is just a simple switch between classes with normal distribution calculation:</p>
      <p>As can be seen in Table 5, the number of classes doesn’t affect performance of generation too much,
except for the parsing stage, where more classes mean there are more possibilities to handle. As for the
dimensions, these directly affect the performance, but that also depends on the expressions specified
for these dimensions.</p>
    </sec>
    <sec id="sec-14">
      <title>4. Discussion</title>
      <p>Generators of synthetic datasets represent a potent tool for conducting controlled experiments and
investigating the performance of machine learning methods in various scenarios. They facilitate an
enhanced understanding of the capabilities and limitations of these methods. For instance, in this study,
the discussed synthetic datasets with known characteristics and distributions were utilized for the
purpose of conducting controlled tests of logistic regression, as illustrated in Table 6. These benchmark
datasets can encompass diverse data complexities, such as class imbalance, missing values, and outliers,
thereby aiding in assessing how effectively logistic regression operates under different conditions and
whether additional tuning is necessary. To assess the effectiveness of the logistic regression model, the
following metrics were employed:</p>
      <p>1. Precision is the ability of the classifier not to label as positive a sample that is negative. In table 6
this metric provides an assessment of the model's overall precision, considering the weights of each
class based on their distribution in the dataset. This allows for accounting for class imbalance, where
one class may have significantly more instances than others.</p>
      <p>2. Accuracy. This metric indicates the accuracy of a classification model, measuring the overall
percentage of correct predictions (both positive and negative) out of the total number of examples in
dataset:

( ,  ̂) =</p>
      <p>1
 samples
 samples −1
∑</p>
      <p>1( ̂ =   ),
 =0
where  ̂ is the predicted value of the  -th sample,   is the corresponding true value, 1( ) is the
indicator function.</p>
      <p>3. Type 1 Error. This metric indicates the precision of the model for the class denoted as "class1"
or the positive class. Type 1 Error measures the percentage of correct positive predictions made by the
model among all positive predictions:

1 
=
where 
(true positive) is correct result,</p>
      <p>+ 
(false positive) is unexpected result.</p>
      <p>(7)
(8)
,
4. Type 2 Error. This metric typically represents the proportion of false negative predictions made
by a classification model, specifically in the context of binary classification tasks. Type 2 Error
quantifies the rate at which the model incorrectly predicts negative outcomes, providing insights into
its ability to avoid missing positive cases:
 + 
where  (true positive) is correct result,  (false negative) is missing result.</p>
      <p>5. F1-Score. This metric is a combination of Precision and Recall and is used to assess the balance
between these two metrics. It is particularly useful in situations where there is class imbalance (different
numbers of instances for different classes) because it considers both Precision and Recall for each class
and calculates their weighted harmonic mean:
 1_
= 2 ∗
 ( ,  ̂) ×  ( ,  ̂)
 ( ,  ̂) +  ( ,  ̂)
,
where  ( ,  ̂) is precision,  ( ,  ̂) is recall.</p>
      <p>Leveraging synthetic dataset generators enables rapid iteration and testing of various hypotheses and
model parameters without the need to wait for real data.</p>
    </sec>
    <sec id="sec-15">
      <title>5. Conclusion</title>
      <p>The article explores the construction of Gaussian distributions using the Box-Muller transform, a
method relying on uniform random numbers to generate pairs of independent Gaussian variables.
However, while efficient for core Gaussian values, the Box-Muller algorithm may fall short in
generating extreme or outlier values crucial for evaluating rare events. To address this limitation, the
article proposes incorporating specialized distributions to generate extreme values in tails. By
combining these extreme values with standard normal random variables, a broader dataset can be
formed, enriching evaluations and bolstering machine learning models.</p>
      <p>This study also introduces a synthetic data generator designed for evaluating data visualization
methods and machine learning systems. The application is highly adaptable, providing users with the
capability to create and store models, generate artificial data, and even explore models created by other
users. This enhanced accessibility promotes collaborative learning and testing of machine learning
models on data. This article also demonstrates the practical utility of the application in the realm of
assessing machine learning algorithms. It illustrates the process of generating various datasets, enabling
precise control over typical challenges encountered in machine learning tasks. The visual
representations created for each scenario provide compelling evidence of the tool's reliability in
validating diverse situations.</p>
      <p>
        In future research endeavors, the exploration of employing machine learning techniques to enhance
the realism of synthetic data by introducing common noise patterns observed in real-world data, while
(9)
(10)
preserving the fundamental distribution, can be considered. Additionally, another avenue of
investigation involves the generation of synthetic data encompassing categorical, time-series, or mixed
data types. This would enable the utilization of the generated synthetic data within the context of the
Computing with Words Model [
        <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
        ] and other fuzzy set models.
      </p>
      <p>In conclusion, the creation of benchmark datasets for testing data analytics tools represents a crucial
step in the advancement of data science and machine learning research, offering a standardized means
of evaluating and comparing the performance of various analytical methodologies. These benchmark
datasets not only facilitate fair and rigorous assessment of data analytics tools but also open avenues
for future research in the refinement of synthetic data generation techniques and the development of
more comprehensive and realistic benchmark datasets.</p>
    </sec>
    <sec id="sec-16">
      <title>6. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Thiyagalingam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shankar</surname>
          </string-name>
          , G. Fox, and T. Hey, “
          <article-title>Scientific machine learning benchmarks</article-title>
          ,
          <source>” Nature Reviews Physics</source>
          , vol.
          <volume>4</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>413</fpage>
          -
          <lpage>420</lpage>
          , Apr.
          <year>2022</year>
          , doi: https://doi.org/10.1038/s42254-022-00441-7.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          , “
          <article-title>ImageNet classification with deep convolutional neural networks,” Communications of the ACM</article-title>
          , vol.
          <volume>60</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>84</fpage>
          -
          <lpage>90</lpage>
          , May
          <year>2012</year>
          , doi: https://doi.org/10.1145/3065386.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Fernández-DelgadoManuel</surname>
          </string-name>
          , CernadasEva, BarroSenén, and AmorimDinani, “
          <article-title>Do we need hundreds of classifiers to solve real world classification problems</article-title>
          ,
          <source>” Journal of Machine Learning Research</source>
          , Jan.
          <year>2014</year>
          , doi: https://doi.org/10.5555/2627435.2697065.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Qayyum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qadir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bilal</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A</given-names>
            .
            <surname>Al-Fuqaha</surname>
          </string-name>
          ,
          <article-title>“Secure and Robust Machine Learning for Healthcare: A Survey,” IEEE Reviews in Biomedical Engineering</article-title>
          , vol.
          <volume>14</volume>
          , pp.
          <fpage>156</fpage>
          -
          <lpage>180</lpage>
          ,
          <year>2021</year>
          , doi: https://doi.org/10.1109/rbme.
          <year>2020</year>
          .
          <volume>3013489</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mendonca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gustavo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Santos</surname>
          </string-name>
          , and T. Araujo, “
          <article-title>Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques</article-title>
          and Tools,” IEEE Access, vol.
          <volume>8</volume>
          , pp.
          <fpage>82917</fpage>
          -
          <lpage>82928</lpage>
          , Jan.
          <year>2020</year>
          , doi: https://doi.org/10.1109/access.
          <year>2020</year>
          .
          <volume>2991949</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Jeske</surname>
          </string-name>
          et al.,
          <article-title>“Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems,” Knowledge Discovery and Data Mining</article-title>
          , Aug.
          <year>2005</year>
          , doi: https://doi.org/10.1145/1081870.1081969.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Lin</surname>
          </string-name>
          et al.,
          <article-title>“Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems</article-title>
          ,” IEEE Xplore,
          <year>Apr</year>
          .
          <volume>01</volume>
          ,
          <year>2006</year>
          . https://ieeexplore.ieee.org/abstract/document/1611688.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Popic</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pavkovic</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velikic</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Teslic</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Data generators: a short survey of techniques and use cases with focus on testing</article-title>
          .
          <source>2019 IEEE 9th International Conference on Consumer Electronics (ICCE-Berlin)</source>
          . https://doi.org/10.1109/ICCEBERLIN47944.
          <year>2019</year>
          .
          <volume>8966202</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F. K.</given-names>
            <surname>Dankar</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ibrahim</surname>
          </string-name>
          , “
          <article-title>Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation,”</article-title>
          <source>Applied Sciences</source>
          , vol.
          <volume>11</volume>
          , no.
          <issue>5</issue>
          , p.
          <fpage>2158</fpage>
          ,
          <string-name>
            <surname>Feb</surname>
          </string-name>
          .
          <year>2021</year>
          , doi: https://doi.org/10.3390/app11052158.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F. K.</given-names>
            <surname>Dankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Ibrahim</surname>
          </string-name>
          , and L. Ismail, “
          <article-title>A Multi-Dimensional Evaluation of Synthetic Data Generators,” IEEE Access</article-title>
          , vol.
          <volume>10</volume>
          , pp.
          <fpage>11147</fpage>
          -
          <lpage>11158</lpage>
          ,
          <year>2022</year>
          , doi: https://doi.org/10.1109/access.
          <year>2022</year>
          .
          <volume>3144765</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Madalina</surname>
            <given-names>Andreica</given-names>
          </string-name>
          , Mugurel Ionut Andreica,
          <string-name>
            <given-names>Nicolae</given-names>
            <surname>Cataniciu</surname>
          </string-name>
          .
          <article-title>Multidimensional Data Structures and Techniques for Efficient Decision Making</article-title>
          .
          <source>Proceedings of the 10th WSEAS International Conference on Mathematics and Computers in Business and Economics (MCBE)</source>
          (ISBN:
          <fpage>978</fpage>
          -
          <lpage>960</lpage>
          -474-063-5 / ISSN:
          <fpage>1790</fpage>
          -
          <lpage>5109</lpage>
          ),
          <year>Mar 2009</year>
          , Prague, Czech Republic. pp.
          <fpage>249</fpage>
          -
          <lpage>254</lpage>
          . ⟨hal-00467676⟩
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ayala-Rivera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>McDonagh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cerqueus</surname>
          </string-name>
          , and L. Murphy, “
          <article-title>Synthetic Data Generation using Benerator Tool</article-title>
          ,” arXiv (Cornell University), Oct.
          <year>2013</year>
          . URL: https://www.researchgate.net/publication/258125711_Synthetic_
          <article-title>Data_Generation_using_Benera tor_Tool</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T. W.</given-names>
            <surname>Göbel</surname>
          </string-name>
          , T. Schäfer, Julien Hachenberger,
          <string-name>
            <given-names>J.</given-names>
            <surname>Türr</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Baier</surname>
          </string-name>
          , “
          <article-title>A Novel Approach for Generating Synthetic Datasets for Digital Forensics</article-title>
          ,” pp.
          <fpage>73</fpage>
          -
          <lpage>93</lpage>
          , Jan.
          <year>2020</year>
          , doi: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -56223-
          <issue>6</issue>
          _
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Ul</surname>
            <given-names>Haq</given-names>
          </string-name>
          , Ikram &amp; Gondal, Iqbal &amp; Vamplew, Peter &amp; Layton,
          <string-name>
            <surname>Robert.</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Generating Synthetic Datasets for Experimental Validation of Fraud Detection</article-title>
          .
          <source>Fourteenth Australasian Data Mining Conference, Canberra, Australia. Conferences in Research and Practice in Information Technology</source>
          , Vol.
          <volume>170</volume>
          . URL: https://www.researchgate.net/publication/316878436_Generating_Synthetic_Datasets_for_Exper imental_Validation_of_Fraud_Detection.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Arvanitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Harrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chaplin</surname>
          </string-name>
          , and G. Despotou, “
          <article-title>A method for machine learning generation of realistic synthetic datasets for validating healthcare applications,” Health Informatics Journal</article-title>
          , vol.
          <volume>28</volume>
          , no.
          <issue>2</issue>
          , p.
          <fpage>146045822210770</fpage>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          .
          <year>2022</year>
          , doi: https://doi.org/10.1177/14604582221077000.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Bullward</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aljebreen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coles</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McInerney</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          (
          <year>2023</year>
          ). Research Paper:
          <article-title>Process Mining and Synthetic Health Data: Reflections and Lessons Learnt</article-title>
          . In: Montali,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Senderovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Weidlich</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>(eds) Process Mining Workshops</article-title>
          .
          <source>ICPM</source>
          <year>2022</year>
          , vol
          <volume>468</volume>
          . Springer, Cham. https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -27815-0_
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Gräßler</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hieb</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roesmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Unverzagt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Creating Synthetic Training Data for Machine Vision Quality Gates</article-title>
          . In: Lohweg,
          <string-name>
            <surname>V</surname>
          </string-name>
          . (eds) Bildverarbeitung in der Automation.
          <source>Technologien für die intelligente Automation</source>
          , vol
          <volume>17</volume>
          . Springer Vieweg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-
          <fpage>662</fpage>
          -66769-
          <issue>9</issue>
          _
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Barrera-Animas and J. M. Davila Delgado</surname>
          </string-name>
          , “
          <article-title>Generating real-world-like labelled synthetic datasets for construction site applications,” Automation in Construction</article-title>
          , vol.
          <volume>151</volume>
          , p.
          <fpage>104850</fpage>
          ,
          <string-name>
            <surname>Jul</surname>
          </string-name>
          .
          <year>2023</year>
          , doi: https://doi.org/10.1016/j.autcon.
          <year>2023</year>
          .
          <volume>104850</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C.</given-names>
            <surname>Manettas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nikolakis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Alexopoulos</surname>
          </string-name>
          , “
          <article-title>Synthetic datasets for Deep Learning in computer-vision assisted tasks in manufacturing,” Procedia CIRP</article-title>
          , vol.
          <volume>103</volume>
          , pp.
          <fpage>237</fpage>
          -
          <lpage>242</lpage>
          ,
          <year>2021</year>
          , doi: https://doi.org/10.1016/j.procir.
          <year>2021</year>
          .
          <volume>10</volume>
          .038.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Holst</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schoepflin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schüppstuhl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Generation of Synthetic AI Training Data for Robotic Grasp-Candidate Identification and Evaluation in Intralogistics Bin-Picking Scenarios</article-title>
          . In: Kim, KY.,
          <string-name>
            <surname>Monplaisir</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rickli</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          . (eds)
          <article-title>Flexible Automation and Intelligent Manufacturing: The Human-Data-Technology Nexus</article-title>
          .
          <source>FAIM 2022. Lecture Notes in Mechanical Engineering</source>
          . Springer, Cham. https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -18326-3_
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Rather</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Generative adversarial network based synthetic data training model for lightweight convolutional neural networks</article-title>
          .
          <source>Multimed Tools Appl</source>
          (
          <year>2023</year>
          ). https://doi.org/10.1007/s11042-023-15747-6 .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G.E.P.</given-names>
            <surname>Box and M.E. Muller</surname>
          </string-name>
          , “
          <article-title>A Note on the Generation of Random Normal Deviates,”</article-title>
          <source>The Annals of Mathematical Statistics</source>
          , vol.
          <volume>29</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>610</fpage>
          -
          <lpage>611</lpage>
          , Jun.
          <year>1958</year>
          , doi: https://doi.org/10.1214/aoms/1177706645.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>D.B. Thomas</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Luk</surname>
            ,
            <given-names>P.H.W.</given-names>
          </string-name>
          <string-name>
            <surname>Leong</surname>
            , and
            <given-names>J.D.</given-names>
          </string-name>
          <string-name>
            <surname>Villasenor</surname>
          </string-name>
          , “
          <article-title>Gaussian random number generators,” ACM Computing Surveys</article-title>
          , vol.
          <volume>39</volume>
          , no.
          <issue>4</issue>
          , p.
          <fpage>11</fpage>
          ,
          <string-name>
            <surname>Nov</surname>
          </string-name>
          .
          <year>2007</year>
          , doi: https://doi.org/10.1145/1287620.1287622.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Albashir</surname>
          </string-name>
          , Mohd,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ibrahim</surname>
          </string-name>
          , and Noratiqah Mohd Ariff, “
          <article-title>Extreme Value Distributions: An Overview of Estimation</article-title>
          and Simulation,
          <source>” Journal of Probability and Statistics</source>
          , vol.
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          , Oct.
          <year>2022</year>
          , doi: https://doi.org/10.1155/
          <year>2022</year>
          /5449751.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>O.</given-names>
            <surname>Tymchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pylypenko</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Iepik</surname>
          </string-name>
          , '
          <article-title>Forecasting of Categorical Time Series Using Computing with Words Model'</article-title>
          ,
          <source>in Selected Papers of the IX International Scientific Conference 'Information Technology and Implementation' (IT&amp;I-</source>
          <year>2022</year>
          ), Workshop Proceedings, Kyiv, Ukraine, November 30 - December 02,
          <year>2022</year>
          , vol.
          <volume>3384</volume>
          , pp.
          <fpage>151</fpage>
          -
          <lpage>159</lpage>
          . URL: https://ceurws.org/Vol-
          <volume>3384</volume>
          /Short_2.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kiktev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Osypenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shkurpela</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Balaniuk</surname>
          </string-name>
          ,
          <article-title>"Input Data Clustering for the Efficient Operation of Renewable Energy Sources in a Distributed Information System,"</article-title>
          <source>2020 IEEE 15th International Conference on Computer Sciences and Information Technologies (CSIT)</source>
          , Zbarazh, Ukraine,
          <year>2020</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>12</lpage>
          , doi: 10.1109/CSIT49958.
          <year>2020</year>
          .9321940
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>