=Paper= {{Paper |id=Vol-3687/Paper_2.pdf |storemode=property |title=Simulated Datasets Generator for Testing Data Analytics Methods |pdfUrl=https://ceur-ws.org/Vol-3687/Paper_2.pdf |volume=Vol-3687 |authors=Serhii Toliupa,Anna Pylypenko,Oleh Tymchuk,Oleksii Kohut |dblpUrl=https://dblp.org/rec/conf/dsmsi/ToliupaPTK23 }} ==Simulated Datasets Generator for Testing Data Analytics Methods== https://ceur-ws.org/Vol-3687/Paper_2.pdf
                         Simulated Datasets Generator for Testing Data Analytics
                         Methods
                         Serhii Toliupa, Anna Pylypenko, Oleh Tymchuk and Oleksii Kohut
                         Taras Shevchenko National University of Kyiv, 64/13 Volodymyrska St, Kyiv, 01601, Ukraine

                                         Abstract
                                         This article explores the role of benchmark datasets in testing data analysis tools. The use of
                                         synthetic data generators is motivated by the need for scaling the size of training datasets,
                                         filling data gaps, and safeguarding data confidentiality. Existing research in this field
                                         emphasizes the importance of applications in various domains, such as fraud detection,
                                         healthcare, and computer graphics. The efficacy of constructing Gaussian distributions using
                                         the Box-Muller transform is investigated, while limitations in generating extreme values are
                                         highlighted. The integration of specialized distributions is proposed to address this gap and
                                         enhance dataset variability for improved performance in data analytics methods. The article
                                         presents a synthetic data generator capable of producing datasets for effective evaluation of
                                         machine learning methods. Practical tests demonstrate the software's effectiveness in creating
                                         test datasets with controlled variations. Four different tests were conducted, each with three
                                         different variants: 1) normal distribution parameters, namely standard deviation, 2) class
                                         imbalance, 3) missing values, and 4) outliers. The generated datasets were used to conduct
                                         controlled tests of logistic regression. Performance evaluation of the logistic regression model
                                         employed metrics such as Precision, Accuracy, Type 1 Error, Type 2 Error, and F1-Score.
                                         Keywords 1
                                         Benchmark datasets creation, synthetic dataset generator, analysis methods, extreme values.

                         1. Introduction
                             The modern world generates an enormous volume of data every day, ranging from sensor data to
                         textual information. Data analytics tools require the availability of realistic and representative datasets
                         for testing that would reflect contemporary challenges in data processing. The relevance of this topic is
                         also driven by the development of artificial intelligence and machine learning: the high demand for the
                         development of machine learning and artificial intelligence algorithms necessitates data for training and
                         evaluating these algorithms. Creating realistic datasets becomes critically important for precise
                         assessment and comparison of various methods. The complexity of this problem is associated with its
                         interdisciplinary nature. Analytical tools and methods are used across various domains, including
                         medicine, finance, biology, ecology, and others. The creation of benchmark datasets can also contribute
                         to addressing ethical concerns related to data privacy and security by developing anonymized or
                         synthetic data. So, the creation of benchmark datasets for testing data analytics tools is essential for
                         advancing science and technology in this field and ensuring their practical utility across diverse
                         domains.
                             There are a lot of published research results that provides comprehensive insights into the process
                         of creating benchmark datasets and its significance. In [1], the authors introduce the concept of
                         benchmark metrics in machine learning for scientific purposes and review existing approaches. They
                         underscore that the selection of the most appropriate machine learning algorithm for scientific data
                         analysis remains a significant challenge due to the potential applicability of machine learning
                         frameworks and models, computer architectures. The results of the research by authors Krizhevsky,
                         Sutskever, and Hinton [2] were the first to demonstrate the profound influence of datasets on the

                         Dynamical System Modeling and Stability Investigation (DSMSI-2023), December 19-21, 2023, Kyiv, Ukraine
                         EMAIL: tolupa@i.ua (A. 1); anna.pylypenko@knu.ua (A. 2); oleh.tymchuk@knu.ua (A. 3); oleksii_kohut@knu.ua (A. 4)
                         ORCID: 0000-0002-1919-9174 (A. 1); 0000-0002-6343-4469 (A. 2); 0000-0002-9046-8015 (A. 3); 0009-0004-2203-3927 (A. 4)
                                      ©️ 2023 Copyright for this paper by its authors.
                                      Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                      CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
Workshop
                  ceur-ws.org
              ISSN 1613-0073
                                                                                                                                                 11
Proceedings
learning outcomes of deep neural networks. It can be instrumental in comprehending the significance
of benchmarking. The article [3] by M. Fernandez-Delgado, E. Cernadas, S. Barro and D. Amorim is a
significant contribution to the field of machine learning and data classification, providing practical
insights into the use of classifiers in real-world scenarios. The article explores various classification
algorithms and compares their performance across different benchmark datasets, which can serve as a
valuable resource for discussing the effectiveness of various analytical tools.
    Each data generator program uses a unique approach to data creation. The article [5] presents a data
generator designed to fill in gaps that may exist in other programs. The developed system allows users
to customize and create known statistical distributions to achieve the desired outcome. Additionally, it
offers real-time data behavior visualization to analyze whether they possess the characteristics
necessary for effective testing. In the articles [6, 7], the authors provide an overview of the design and
architecture of the Information Discovery and Analysis Systems (IDAS) Data Set Generator (IDSG),
which enables a fast and comprehensive evaluation of IDAS. IDSG generates data using statistical
algorithms, rule-based algorithms, and semantic graphs that represent interdependencies between
attributes. To illustrate this approach, an application for credit card transactions is used. Sran Popić et
al. [8] provided a brief overview of various types of generators in terms of their architecture and
anticipated usage, as well as listed their advantages and disadvantages. They also presented a review of
the data generation algorithms used and best practices in various domains.
    Researchers in this field has attempted to assess the utility of synthetic data generators using various
evaluation metrics. However, it has been found that these metrics lead to conflicting conclusions,
complicating the direct comparison of synthetic data generators. In their study, Fida Kamal Dankar and
colleagues [9, 10] identified four criteria for evaluating masked data by categorizing available utility
metrics into different categories based on the information they seek to preserve: attribute fidelity, two-
dimensional parameter fidelity, population fidelity, and application fidelity. In the article [11], the
authors have introduced several novel and efficient methods and multidimensional data structures that
can enhance the decision-making process in various domains. They have examined online range
aggregation, range selection, and weighted range median queries; for most of these, data structures and
techniques are presented that can provide answers in near-polynomial time.
    In practice, obtaining real data can be challenging due to confidentiality issues. Additionally, real
data may not conform to specific characteristics required for evaluating new approaches under certain
conditions. Given these constraints, the use of synthetic data becomes a viable alternative to supplement
real data in various domains. For example, in the article [12], the authors described the process of
generating synthetic data using the publicly available tool Benerator to mimic the distribution of
aggregated statistical data obtained from the national population census. The generated datasets
successfully replicated microdata containing records with social, economic, and demographic
information. Forensics also requires testing digital information tools. Thomas GΓΆbel et al. [13]
introduced the concept of a structure called hystck for creating synthetic datasets based on the ground
truth. This framework supports automated generation of synthetic network traffic and artifacts of
operating systems and software by simulating human-computer interactions. To preserve
confidentiality, banks are unwilling to share fraud statistics and datasets with the public. To overcome
these limitations, Ikram Ul Haq et al. [14] introduced an innovative technique for generating uniformly
distributed synthetic data (HCRUD) based on highly correlated rules. This technique allows the
generation of synthetic datasets of any size, replicating the characteristics of restricted actual fraud data,
thus supporting further research in fraud detection. Access to medical datasets is also complicated due
to concerns about patient confidentiality. The development of synthetic datasets that are realistic enough
for testing digital applications is considered as a potential alternative that allows their deployment.
Theodoros Arvanitis et al. [15] have devised a method for generating synthetic data statistically
equivalent to real clinical datasets and have demonstrated that the approach based on Generative
Adversarial Networks aligns with this goal. Thus, the concept of creating realistic medical synthetic
datasets has been successfully validated. However, data quality issues exist both in real and synthetic
data, with the latter reflecting real-world problems and artifacts created by synthetic datasets. The
intellectual analysis of synthetic healthcare data represents a novel field with its unique challenges.
According to Alistair Bullward et al. [16], researchers should be aware of the risks associated with
extrapolating results from synthetic data studies to real-world scenarios and should evaluate outcomes
using analysts who can review the underlying data. Synthetic data is frequently utilized in computer

                                                                                                           12
graphics, which is used for training computer vision models, as mentioned in [17, 18]. In many
industrial computer vision tasks, deep learning methods such as convolutional neural networks have
been successfully employed, as indicated by the works [19, 20]. In recent years, generative adversarial
networks (GANs) have been effectively utilized for generating new realistic images and manipulating
them, as noted in the research by I. H. Rather and S. Kumar [21].
   Therefore, the increasing availability and utilization of data analytics tools make the standardization
of benchmark datasets an essential task for their further adoption across various fields. The challenge
of developing more objective metrics and methods for assessing the utility of synthetic data remains
unresolved. One of these problems is adequate tail-end modeling of probability density functions.
Extreme values can significantly impact risks and outcomes in several sectors such as finance,
insurance, climatology, engineering, among others. This article presents research in the field of creating
synthetic data that aligns with real-world requirements and enables their effective use for testing various
analytical tools. The primary focus is on comprehending and analyzing characteristics of this generator,
especially in the tail areas of the Gaussian probability density function.

2. Mathematical methods
    The generation of Gaussian distributions is foundational in various scientific and computational
fields, playing a pivotal role in modeling natural phenomena and simulating random variables. To create
a Gaussian distribution, it is suggested to use the Box-Muller transform [22]. It produces a pair of
Gaussian random numbers using a pair of uniform numbers. The fundamental principle of the Box-
Muller algorithm lies in its ability to generate pairs of independent standard Gaussian random variables
from uniformly distributed random numbers. This method leverages the polar coordinate representation
in a two-dimensional space to transform pairs of independent uniformly distributed variables into sets
of normally distributed variables. By employing trigonometric functions and geometric interpretations,
this algorithm constructs Gaussian values by utilizing the magnitude and angle derived from uniformly
generated random variables. The algorithm to generate the Gaussian samples:
    1. Generate sample using two distinct uniform random number generators, 𝑒₁ and 𝑒₂.
    2. Apply the inverse cumulative distribution function (CDF) of the exponential distribution to 𝑒₁
(πœ† = 1):
                                π‘Ÿ = βˆšβˆ’2𝑙𝑛(1 βˆ’ 𝑒1 ) = βˆšβˆ’2𝑙𝑛(𝑒1 ),                                    (1)
    where π‘Ÿ is the distance from origin for each sample.
    For simplicity, 𝑒₁ is replaced by 1 βˆ’ 𝑒₁ since they are both uniform samples on (0, 1).
    3. Apply the inverse CDF of the uniform distribution on (0, 2πœ‹) to 𝑒₂:
                                            πœƒ = 2πœ‹π‘’β‚‚,                                                   (2)
    where πœƒ is the angle of the sample.
    4. Finally, determine the x and y using basic trigonometric calculations:
                                    π‘₯ = π‘Ÿπ‘π‘œπ‘ (πœƒ), 𝑦 = π‘Ÿ 𝑠𝑖𝑛(πœƒ).                                           (3)
    The Box-Muller algorithm generates two independent random numbers upon each execution. Each
pair of generated numbers represents two independent random variables following a standard normal
distribution. These variables possess a mean of 0 and a standard deviation of 1. The produced values
can be utilized as required for various statistical or computational purposes. For instance, both numbers
from each pair can be employed to generate pairs of normally distributed random variables.
Alternatively, a single number from each pair might suffice if the need is for a singular normally
distributed value. By combining both sets of generated values into a single dataset, a larger and
potentially more diverse sample can be constructed, enriching the dataset used for testing machine
learning models. This integration allows for a broader spectrum of data points, enhancing the robustness
of the evaluation process and potentially fortifying the model's generalization capabilities.
    The Box-Muller algorithm, while efficient in generating core values from the standard normal
distribution, may necessitate additional methods to generate values from the tails of the distribution
[23]. Extreme or outlier values that reside in the tails of the distribution are often critical for assessing
rare events or evaluating the performance of algorithms. The generation of extreme values in random
numbers can be achieved using specialized distributions. Some distributions, such as the exponential,

                                                                                                          13
Weibull, FrΓ©chet extreme value distribution, among others, directly model extreme values. The
proposed algorithm incorporates the following steps to generate such values.
   Continuation of the algorithm to generate the Gaussian samples:
   5. Set the parameter 𝑐 represents the shape parameter governing the tail behavior. It determines
the shape and heaviness of the distribution's tails.
   с > 0: indicates a distribution with bounded tails. It implies that the distribution's tails are bounded,
and the probability of encountering extreme values decreases more rapidly than in a normal distribution.
   с = 0: corresponds to the exponential distribution, where the tails are light, and the probability of
extreme values decreases exponentially.
   с < 0: indicates a distribution with heavier tails than the exponential distribution. This suggests
that the probability of encountering extreme values decreases slower than in an exponential distribution.
   6. Establish the probability density function of the distribution is given by the formula:
                                          βˆ’1/𝑐
                                  𝑒 βˆ’[1+𝑐𝑑]      β‹… (1 + 𝑐𝑑)βˆ’1/π‘βˆ’1 for 𝑐 β‰  0,
                    𝑓(𝑑; 𝑐) = {                  βˆ’π‘‘
                                                                                                       (4)
                                        𝑒 βˆ’π‘’ β‹… 𝑒 βˆ’π‘‘ for 𝑐 = 0,
   use the inverse transform method to generate extreme values:
                                 1
                                   (βˆ’π‘™π‘›(1 βˆ’ 𝑒)βˆ’π‘ βˆ’ 1) for 𝑐 β‰  0,
                            𝑑 = {𝑐                                                                     (5)
                                    βˆ’π‘™π‘›(βˆ’π‘™π‘›(1 βˆ’ 𝑒) for 𝑐 = 0,
    where 𝑒 is a random variable from a uniform distribution on the interval (0, 1) and can be obtained
as a fraction of random variables generated in step 1. Validate the generated extreme values to ensure
they align with the expected tail behavior.
    7. After obtaining the standard normal random variables and extreme values 𝑧 ∢=
π‘π‘œπ‘›π‘π‘Žπ‘‘π‘’π‘›π‘Žπ‘‘π‘’(π‘₯, 𝑦, 𝑑), can move on to a normally distributed value with mathematical expectation πœ‡
and standard deviation 𝜎:
                                          πœ‰ = πœ‡ + πœŽπ‘§.                                                   (6)
    Depending on the practical task there might be a need to shift the distribution along the axis of values
or change its scale. In such cases, it would be advantageous to employ the Generalized Extreme Value
(GEV) distributions, which combine the Gumbel, FrΓ©chet, and Weibull families, also known as type I,
II, and III extreme value distributions [24]. These distributions offer flexibility in adjusting the
positioning and scaling of the distribution to accommodate various scenarios and analyses.

3. Development of synthetic dataset generator
    In the domain of data generation for machine learning research, a spectrum of tools and libraries,
including MOSTLY.AI (https://mostly.ai/), Mockaroo (https://www.mockaroo.com/), and Scikit-learn
(https://scikit-learn.org/stable/datasets/sample_generators.html), among others, is at researchers'
disposal. These tools furnish a fundamental framework for the generation of synthetic datasets, enabling
users to craft data with predetermined statistical distributions. Nevertheless, they exhibit inherent
limitations. For instance, scikit-learn is primarily constrained by its capacity to generate synthetic
datasets featuring a limited array of distributions and parameters, rendering it less suitable for the
creation of intricate or verisimilar data representations. These tools primarily target numerical data and
lack the specialized capabilities required for the synthesis of structured data types, such as textual
information, categorical attributes, or time series. In the subsequent sections of this article, the
utilization of synthetic data will be examined primarily in the context of testing machine learning
methods, considering the noted shortcomings of existing tools.

3.1.    Data model
   The specific capabilities of a synthetic data generator may vary from implementation. However,
such software should have the following main features:
   ο‚·    defining the name, description, and additional information associated with the model;


                                                                                                         14
    ο‚·    describing all dimensions of the model, including their data type, name, description, and most
    importantly, the algorithm for generating the value;
    ο‚·    export synthetic data rows to a file or database based on the created model.
    Each data model must have at least one dimension. Each dimension should be defined by the
following characteristics: dimension name; data type (integer or real number, category, string);
expression/formula that defines how the data will be filled in; additional options (presence of blank
values/outliers). Values in columns can be either independent expressions or calculated based on values
in other columns (while avoiding cyclic dependencies, where, for instance, expression A relies on the
value from column B, and expression B relies on the value from column A). The data model serves as
an abstraction of a dataset, comprising specifications that characterize the behavior of the data [5].
Figure 1 depicts an example of rudimentary data model illustrating the characterization of specific
individuals for the purpose of analyzing patient ages. There are three dimensions:
    1) Name: an informative dimension, intended more for identifying the string than for analyzing
the data. Such strings are not useful in machine learning, but if it were real data, there would be a
privacy issue. Given that this string is generated in a random manner, there is no need for concern
regarding the utilization of personal data belonging to individuals. Since this dimension is defined
without modifiers, the data in this row will always be present.
    2) Age: dimension determines the patient's age. Its expression defines a random value in a normal
distribution with a mean of 14.0 and a variance of 3.0. There are no blank fields.
    3) IsAdult: dimension determines whether the patient is an adult. This is the only dimension that
uses another dimension in its calculations. If the patient is over 18, this field is set to 1, otherwise 0.
There are no blank fields.




Figure 1: Example of data model

3.2.    Modelling of the generator
   Editing a data model includes editing general information about the model, such as the name,
description, and availability to other users. Most importantly, in this mode, you can edit each dimension
separately, with a real-time check for the correctness of the entered data. Dimensions can be added or
deleted, but each model will always have at least 1 dimension with a line number. Each dimension must
be given a unique name that will not match other dimensions within the model. All dimensions must be
one of the defined data types (see Table1).
Table 1
Supported data types
     Data type                        Description                                      Examples
 String            A sequence of symbols with variable length        "Oleksii", "Kohut", "test123"
 Integer                                 π‘₯βˆˆβ„€                         -2, -1, 0, 1, 2...
 Real                                    π‘₯βˆˆβ„                         0.5, 1.3e22, 3.14, -0.001
 Category          Custom data type, choose one item from given      (1, 2, 3),
                   list                                              ("apple", "banana", "orange")
 Boolean           Logical data type (example usage of category      (0, 1), ("true", "false")
                   type)


                                                                                                        15
   The lists of operands and functions available for generation is described in Tables 2-4.
Table 2
The list of operand generators
            Operand                            Description                    Usage example
               +                     Addition                                      x+y
               -                     Substraction                                  x–y
               *                     Multiplication                                x*y
               /                     Division                                       x/y
               ^                     Exponentiation                                x^y
               ?                     Conditional operator                      (a = 0) ? x : y
Table 3
The list of random generators
           Distribution                         Arguments                      Usage example
 Uniform                                          π‘Ž, 𝑏 ∈ ℝ            Uniform (a, b)
 Normal (Gaussian)                               πœ‡, 𝜎 ∈ ℝ             Gauss (mu, sigma)
 Cauchy                                         π‘₯0 , 𝛾 ∈ ℝ            Cauchy (x0, gamma)
 Poisson                                           πœ† ∈ ℝ              Poisson (lambda)
 Bernoulli                                         π‘βˆˆ ℝ               Bernoulli (p)
 Categorical                                 π‘Ž0 , π‘Ž1 … π‘Žπ‘§ ∈ 𝐢         Category ("a", "b", "c")
                                             π‘Ž0 , π‘Ž1 … π‘Žπ‘§ ∈ 𝐢,        Weighted_category (("a", 1.0),
 Weighted categorical
                                             𝑝0 , 𝑝1 … 𝑝𝑧 ∈ ℝ         ("b", 2.0), ("c", 3.0))
Table 4
The list of function generators
          Function                              Arguments                       Usage example
Modulo                            x – value, π‘₯ ∈ ℝ                     abs(x); |x|
Sign                              x – value, π‘₯ ∈ ℝ                     sign(x)
Sine                              x – value, π‘₯ ∈ ℝ                     sin(x)
Cosine                            x – value, π‘₯ ∈ ℝ                     cos(x)
Tangent                           x – value, π‘₯ ∈ ℝ                     tan(x)
Exponent                          x – value, π‘₯ ∈ ℝ                     exp(x)
Logarithm                         x – value, p - base, π‘₯, 𝑝 ∈ ℝ        log(x, p)
Natural logarithm                 x – value, π‘₯ ∈ ℝ                     ln(x)
Square root                       x – value, π‘₯ ∈ ℝ                     sqrt(x)
Cube root                         x – value, π‘₯ ∈ ℝ                     cbrt(x)
Root of arbitrary base            x – value, p – base, π‘₯, 𝑝 ∈ ℝ        nroot(x, p)



3.3.    Usage scenarios: generating test datasets
   To verify usefulness of such a software, practical tests were performed with the help of the program.
A data model was specified in the generator and multiple tweaks were applied to it. The data set was
created using the proposed algorithm to generate the Gaussian samples (1) – (6). The default model is
described as follows:
   ο‚·    two classes;
   ο‚·    two dimensions: one dimension for the class, another for the value;
   ο‚·    value is generated depending on class with the help of ternary operator and normal distribution;
   ο‚·    no outliers, no blank values.


                                                                                                     16
    The dataset was specified on the data model tab. For each test a CSV file with 1000 entries was
generated. Then, with the help of Python, graphs for this data were drawn. This provides a visual clue
about how the change affects the data. A total of four different tests were performed, each with three
different variations.

3.3.1. Changing parameters of normal distribution
   In this test, the standard deviation of normal distribution for both classes were changed. The values
chosen are 5.0, 3.0, 1.0. The expected result of such change is that values would become less dispersed
across the axis. The result of the software for this case is presented in Figures 2-3.




Figure 2: Data model options with standard deviation changes




Figure 3: Resulting data graphs with standard deviation changes (original.csv, std3.csv, std1.csv)
   For the current test, a swarm plot was used. The color represents the class of the item, and its position
on the Y axis represents the value. With each next plot the values are getting more packed around central
value, confirming that standard deviation is indeed changing.

3.3.2. Changing class distribution
   Here, a weighted category was introduced in place of default category. For class 1, the probability
of appearance will decrease each time, and for class 2 it will increase. The rate of change is 10%. So,
the first distribution of classes will be 50-50%, then 40-60%, then 30-70% (Figures 4-5).
    As can be seen on the Figure 5, the amount of class2 entries in getting higher with each next picture,
confirming that this feature works.




                                                                                                         17
Figure 4: Data model options with class distribution changes




Figure 5: Resulting data graphs with class distribution changes (original.csv, 40-60.csv, 30-70.csv)


3.3.3. Adding outliers
   To use outliers, an additional dimension was added. A random value is calculated with the help of
uniform distribution, and if it’s less than certain threshold (which equals probability of such event), then
the value would be multiplied by 5 (Figures 6-7). Otherwise, the expected value will be placed.




Figure 6: Data model options with added outliers
   It’s clear that with the increasing probability of outliers appearing, number of outliers will be bigger.
Also, it’s worth noting that class2 outliers reach higher values than class1 because of bigger base value.
This creates additional separation between class1 and class2 in the higher values.




                                                                                                         18
Figure 7: Resulting data graphs with outliers (outliers5.csv, outliers10.csv, outliers20.csv)

3.3.4. Setting up missing values
   It’s possible to make missing values in the dataset in a similar manner to outliers, with the help of
an additional dimension. The only change is that missing() function is used instead of multiplication
(Figures 8-9).




Figure 8: Data model options with blank values
    For this test, an event plot was used instead of swarm plot. Colored lines represent objects generated
(from 1 to 1000) with corresponding class. Each black line represents a missing value.

3.4. Software description
3.4.1. Module description
   Software implementation of this synthetic dataset generator (lexData) consists of the three main
parts: tokenizer, parser, and calculator. The input of this system is text expression describing formula
for the dimension value. Figure 10 depicts how the text input can be processed into result:
   In this simple example, there are multiple steps. In the first step, the input string is parsed into
multiple tokens. A token is an atomic part of the mathematical formula. A single plus sign is a token,
but a whole number or variable name is also a token, since we can’t just divide the number into digits.
After that, the sequence of tokens is handled by the parser, which builds a complete function. After
passing concrete values for parameters (which may be other dimensions) specified in the function, the
final value can be calculated and returned as the result. However, there are many more details such as
verifying if referenced function exists or if there is no circular dependency. The output of calculation is
just a single number or category, depending on the expression specified.


                                                                                                        19
Figure 9: Resulting data graphs with blank values (original.csv, blank10.csv, blank20.csv)




Figure 10: Module Interactions: Inputs and Output

3.4.2. Constraints and decisions
   The .NET Framework and programming language C# were used by the designer of generator
software. The generator core is implemented on top of custom-created library for parsing and evaluating
math expressions, called lexCalculator. To integrate it with this software, it was modified to support
such functions:
   ο‚·    generating random distributions;
   ο‚·    string data types;
   ο‚·    logical expressions;
   ο‚·    complex data types and lists.


                                                                                                    20
   With the help of NUnit, a unit testing was performed on this software. Most of the possible test cases
were checked and tests provided more than 90% of code coverage.
   The system was tested on the following system specifications:
   ο‚·     Windows 10 Pro;
   ο‚·     Intel(R) Core(TM) i5-8350U CPU;
   ο‚·     16GB RAM;
   ο‚·     NVMe SK hynix 256GB SSD.
   With such specifications, it was found out how the generator performed on the different datasets.
The following table describes how many rows were generated per second on average for the specified
dataset. Each dimension is just a simple switch between classes with normal distribution calculation:
   As can be seen in Table 5, the number of classes doesn’t affect performance of generation too much,
except for the parsing stage, where more classes mean there are more possibilities to handle. As for the
dimensions, these directly affect the performance, but that also depends on the expressions specified
for these dimensions.
Table 2
Performance testing of the software
                 Dataset description                                                Rows/s
               2 classes, 2 dimensions                                            16534 rows
               2 classes, 4 dimensions                                            10233 rows
               2 classes, 8 dimensions                                            7610 rows
               3 classes, 2 dimensions                                            15669 rows

4. Discussion
    Generators of synthetic datasets represent a potent tool for conducting controlled experiments and
investigating the performance of machine learning methods in various scenarios. They facilitate an
enhanced understanding of the capabilities and limitations of these methods. For instance, in this study,
the discussed synthetic datasets with known characteristics and distributions were utilized for the
purpose of conducting controlled tests of logistic regression, as illustrated in Table 6. These benchmark
datasets can encompass diverse data complexities, such as class imbalance, missing values, and outliers,
thereby aiding in assessing how effectively logistic regression operates under different conditions and
whether additional tuning is necessary. To assess the effectiveness of the logistic regression model, the
following metrics were employed:
    1. Precision is the ability of the classifier not to label as positive a sample that is negative. In table 6
this metric provides an assessment of the model's overall precision, considering the weights of each
class based on their distribution in the dataset. This allows for accounting for class imbalance, where
one class may have significantly more instances than others.
    2. Accuracy. This metric indicates the accuracy of a classification model, measuring the overall
percentage of correct predictions (both positive and negative) out of the total number of examples in
dataset:
                                                            𝑛samples βˆ’1
                                                     1
                            π‘Žπ‘π‘π‘’π‘Ÿπ‘Žπ‘π‘¦(𝑦, 𝑦̂) =                  βˆ‘          1(𝑦̂𝑖 = 𝑦𝑖 ),                   (7)
                                                 𝑛samples
                                                               𝑖=0
    where 𝑦̂𝑖 is the predicted value of the 𝑖-th sample, 𝑦𝑖 is the corresponding true value, 1(π‘₯) is the
indicator function.
    3. Type 1 Error. This metric indicates the precision of the model for the class denoted as "class1"
or the positive class. Type 1 Error measures the percentage of correct positive predictions made by the
model among all positive predictions:
                                                          𝑑𝑝
                                    𝑇𝑦𝑝𝑒 1 πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ =              ,                                   (8)
                                                       𝑑𝑝 + 𝑓𝑝
    where 𝑑𝑝 (true positive) is correct result, 𝑓𝑝 (false positive) is unexpected result.


                                                                                                             21
    4. Type 2 Error. This metric typically represents the proportion of false negative predictions made
by a classification model, specifically in the context of binary classification tasks. Type 2 Error
quantifies the rate at which the model incorrectly predicts negative outcomes, providing insights into
its ability to avoid missing positive cases:
                                                              𝑑𝑝
                                 𝑇𝑦𝑝𝑒 2 πΈπ‘Ÿπ‘Ÿπ‘œπ‘Ÿ = 1 βˆ’                ,                                    (9)
                                                           𝑑𝑝 + 𝑓𝑛
    where 𝑑𝑝 (true positive) is correct result, 𝑓𝑛 (false negative) is missing result.
    5. F1-Score. This metric is a combination of Precision and Recall and is used to assess the balance
between these two metrics. It is particularly useful in situations where there is class imbalance (different
numbers of instances for different classes) because it considers both Precision and Recall for each class
and calculates their weighted harmonic mean:
                                                     𝑃(𝑦, 𝑦̂) Γ— 𝑅(𝑦, 𝑦̂)
                                 𝐹1_π‘†π‘π‘œπ‘Ÿπ‘’ = 2 βˆ—                          ,                            (10)
                                                     𝑃(𝑦, 𝑦̂) + 𝑅(𝑦, 𝑦̂)
    where 𝑃(𝑦, 𝑦̂) is precision, 𝑅(𝑦, 𝑦̂) is recall.
    Leveraging synthetic dataset generators enables rapid iteration and testing of various hypotheses and
model parameters without the need to wait for real data.

Table 6
Classification comparison
      Dataset           Precision      Accuracy        Type 1 Error       Type 2 Error        F1-Score
 original.csv          0.860633       0.860000        0.844660           0.876289            0.859972
 std1.csv              1.000000       1.000000        1.000000           1.000000            1.000000
 std3.csv              0.955073       0.955000        0.961905           0.947368            0.955012
 40-60.csv             0.865830       0.865000        0.876712           0.858268            0.863560
 30-70.csv             0.864803       0.865000        0.862745           0.865772            0.860449
 blank10.csv           0.856354       0.856354        0.858696           0.853933            0.856354
 blank20.csv           0.885321       0.885350        0.884058           0.886364            0.885190
 outliers5.csv         0.865167       0.865000        0.870968           0.859813            0.864888
 outliers10.csv        0.860000       0.860000        0.871560           0.846154            0.860000
 outliers20.csv        0.868215       0.865000        0.831776           0.903226            0.864848

5. Conclusion
    The article explores the construction of Gaussian distributions using the Box-Muller transform, a
method relying on uniform random numbers to generate pairs of independent Gaussian variables.
However, while efficient for core Gaussian values, the Box-Muller algorithm may fall short in
generating extreme or outlier values crucial for evaluating rare events. To address this limitation, the
article proposes incorporating specialized distributions to generate extreme values in tails. By
combining these extreme values with standard normal random variables, a broader dataset can be
formed, enriching evaluations and bolstering machine learning models.
    This study also introduces a synthetic data generator designed for evaluating data visualization
methods and machine learning systems. The application is highly adaptable, providing users with the
capability to create and store models, generate artificial data, and even explore models created by other
users. This enhanced accessibility promotes collaborative learning and testing of machine learning
models on data. This article also demonstrates the practical utility of the application in the realm of
assessing machine learning algorithms. It illustrates the process of generating various datasets, enabling
precise control over typical challenges encountered in machine learning tasks. The visual
representations created for each scenario provide compelling evidence of the tool's reliability in
validating diverse situations.
    In future research endeavors, the exploration of employing machine learning techniques to enhance
the realism of synthetic data by introducing common noise patterns observed in real-world data, while


                                                                                                         22
preserving the fundamental distribution, can be considered. Additionally, another avenue of
investigation involves the generation of synthetic data encompassing categorical, time-series, or mixed
data types. This would enable the utilization of the generated synthetic data within the context of the
Computing with Words Model [25, 26] and other fuzzy set models.
   In conclusion, the creation of benchmark datasets for testing data analytics tools represents a crucial
step in the advancement of data science and machine learning research, offering a standardized means
of evaluating and comparing the performance of various analytical methodologies. These benchmark
datasets not only facilitate fair and rigorous assessment of data analytics tools but also open avenues
for future research in the refinement of synthetic data generation techniques and the development of
more comprehensive and realistic benchmark datasets.

6. References
[1] J. Thiyagalingam, M. Shankar, G. Fox, and T. Hey, β€œScientific machine learning benchmarks,”
     Nature Reviews Physics, vol. 4, no. 6, pp. 413–420, Apr. 2022, doi:
     https://doi.org/10.1038/s42254-022-00441-7.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, β€œImageNet classification with deep convolutional
     neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, May 2012, doi:
     https://doi.org/10.1145/3065386.
[3] FernΓ‘ndez-DelgadoManuel, CernadasEva, BarroSenΓ©n, and AmorimDinani, β€œDo we need
     hundreds of classifiers to solve real world classification problems,” Journal of Machine Learning
     Research, Jan. 2014, doi: https://doi.org/10.5555/2627435.2697065.
[4] A. Qayyum, J. Qadir, M. Bilal, and A. Al-Fuqaha, β€œSecure and Robust Machine Learning for
     Healthcare: A Survey,” IEEE Reviews in Biomedical Engineering, vol. 14, pp. 156–180, 2021,
     doi: https://doi.org/10.1109/rbme.2020.3013489.
[5] P. Mendonca, S. Brito, C. Gustavo, R. Santos, and T. Araujo, β€œSynthetic Datasets Generator for
     Testing Information Visualization and Machine Learning Techniques and Tools,” IEEE Access,
     vol. 8, pp. 82917–82928, Jan. 2020, doi: https://doi.org/10.1109/access.2020.2991949.
[6] D. R. Jeske et al., β€œGeneration of synthetic data sets for evaluating the accuracy of knowledge
     discovery systems,” Knowledge Discovery and Data Mining, Aug. 2005, doi:
     https://doi.org/10.1145/1081870.1081969.
[7] P. J. Lin et al., β€œDevelopment of a Synthetic Data Set Generator for Building and Testing
     Information        Discovery       Systems,”       IEEE        Xplore,      Apr.       01,     2006.
     https://ieeexplore.ieee.org/abstract/document/1611688.
[8] Popic, S., Pavkovic, B., Velikic, I., & Teslic, N. (2019). Data generators: a short survey of
     techniques and use cases with focus on testing. 2019 IEEE 9th International Conference on
     Consumer              Electronics          (ICCE-Berlin).             https://doi.org/10.1109/ICCE-
     BERLIN47944.2019.8966202 .
[9] F. K. Dankar and M. Ibrahim, β€œFake It Till You Make It: Guidelines for Effective Synthetic Data
     Generation,” Applied Sciences, vol. 11, no. 5, p. 2158, Feb. 2021, doi:
     https://doi.org/10.3390/app11052158.
[10] F. K. Dankar, M. K. Ibrahim, and L. Ismail, β€œA Multi-Dimensional Evaluation of Synthetic Data
     Generators,”       IEEE      Access,     vol.     10,      pp.    11147–11158,         2022,    doi:
     https://doi.org/10.1109/access.2022.3144765.
[11] Madalina Andreica, Mugurel Ionut Andreica, Nicolae Cataniciu. Multidimensional Data
     Structures and Techniques for Efficient Decision Making. Proceedings of the 10th WSEAS
     International Conference on Mathematics and Computers in Business and Economics (MCBE)
     (ISBN: 978-960-474-063-5 / ISSN: 1790-5109), Mar 2009, Prague, Czech Republic. pp.249-254.
     ⟨hal-00467676⟩
[12] V. Ayala-Rivera, P. McDonagh, T. Cerqueus, and L. Murphy, β€œSynthetic Data Generation using
     Benerator        Tool,”      arXiv      (Cornell       University),      Oct.       2013.      URL:
     https://www.researchgate.net/publication/258125711_Synthetic_Data_Generation_using_Benera
     tor_Tool



                                                                                                       23
[13] T. W. GΓΆbel, T. SchΓ€fer, Julien Hachenberger, J. TΓΌrr, and H. Baier, β€œA Novel Approach for
     Generating Synthetic Datasets for Digital Forensics,” pp. 73–93, Jan. 2020, doi:
     https://doi.org/10.1007/978-3-030-56223-6_5.
[14] Ul Haq, Ikram & Gondal, Iqbal & Vamplew, Peter & Layton, Robert. (2016). Generating Synthetic
     Datasets for Experimental Validation of Fraud Detection. Fourteenth Australasian Data Mining
     Conference, Canberra, Australia. Conferences in Research and Practice in Information
     Technology, Vol. 170. URL:
     https://www.researchgate.net/publication/316878436_Generating_Synthetic_Datasets_for_Exper
     imental_Validation_of_Fraud_Detection.
[15] T. N. Arvanitis, S. White, S. Harrison, R. Chaplin, and G. Despotou, β€œA method for machine
     learning generation of realistic synthetic datasets for validating healthcare applications,” Health
     Informatics Journal, vol. 28, no. 2, p. 146045822210770, Jan. 2022, doi:
     https://doi.org/10.1177/14604582221077000.
[16] Bullward, A., Aljebreen, A., Coles, A., McInerney, C., Johnson, O. (2023). Research Paper:
     Process Mining and Synthetic Health Data: Reflections and Lessons Learnt. In: Montali, M.,
     Senderovich, A., Weidlich, M. (eds) Process Mining Workshops. ICPM 2022, vol 468. Springer,
     Cham. https://doi.org/10.1007/978-3-031-27815-0_25.
[17] GrÀßler, I., Hieb, M., Roesmann, D., Unverzagt, M. (2023). Creating Synthetic Training Data for
     Machine Vision Quality Gates. In: Lohweg, V. (eds) Bildverarbeitung in der Automation.
     Technologien fΓΌr die intelligente Automation, vol 17. Springer Vieweg, Berlin, Heidelberg.
     https://doi.org/10.1007/978-3-662-66769-9_7.
[18] A. Y. Barrera-Animas and J. M. Davila Delgado, β€œGenerating real-world-like labelled synthetic
     datasets for construction site applications,” Automation in Construction, vol. 151, p. 104850, Jul.
     2023, doi: https://doi.org/10.1016/j.autcon.2023.104850.
[19] C. Manettas, N. Nikolakis, and K. Alexopoulos, β€œSynthetic datasets for Deep Learning in
     computer-vision assisted tasks in manufacturing,” Procedia CIRP, vol. 103, pp. 237–242, 2021,
     doi: https://doi.org/10.1016/j.procir.2021.10.038.
[20] Holst, D., Schoepflin, D., SchΓΌppstuhl, T. (2023). Generation of Synthetic AI Training Data for
     Robotic Grasp-Candidate Identification and Evaluation in Intralogistics Bin-Picking Scenarios. In:
     Kim, KY., Monplaisir, L., Rickli, J. (eds) Flexible Automation and Intelligent Manufacturing: The
     Human-Data-Technology Nexus . FAIM 2022. Lecture Notes in Mechanical Engineering.
     Springer, Cham. https://doi.org/10.1007/978-3-031-18326-3_28.
[21] Rather, I.H., Kumar, S. Generative adversarial network based synthetic data training model for
     lightweight      convolutional     neural     networks.     Multimed     Tools     Appl     (2023).
     https://doi.org/10.1007/s11042-023-15747-6 .
[22] G.E.P. Box and M.E. Muller, β€œA Note on the Generation of Random Normal Deviates,” The
     Annals of Mathematical Statistics, vol. 29, no. 2, pp. 610–611, Jun. 1958, doi:
     https://doi.org/10.1214/aoms/1177706645.
[23] D.B. Thomas, W. Luk, P.H.W. Leong, and J.D. Villasenor, β€œGaussian random number generators,”
     ACM Computing Surveys, vol. 39, no. 4, p. 11, Nov. 2007, doi:
     https://doi.org/10.1145/1287620.1287622.
[24] A. Albashir, Mohd, K. Ibrahim, and Noratiqah Mohd Ariff, β€œExtreme Value Distributions: An
     Overview of Estimation and Simulation,” Journal of Probability and Statistics, vol. 2022, pp. 1–
     17, Oct. 2022, doi: https://doi.org/10.1155/2022/5449751.
[25] O. Tymchuk, A. Pylypenko, and M. Iepik, β€˜Forecasting of Categorical Time Series Using
     Computing with Words Model’, in Selected Papers of the IX International Scientific Conference
     β€˜Information Technology and Implementation’ (IT&I-2022), Workshop Proceedings, Kyiv,
     Ukraine, November 30 - December 02, 2022, vol. 3384, pp. 151–159. URL: https://ceur-
     ws.org/Vol-3384/Short_2.pdf.
[26] N. Kiktev, V. Osypenko, N. Shkurpela and A. Balaniuk, "Input Data Clustering for the Efficient
     Operation of Renewable Energy Sources in a Distributed Information System," 2020 IEEE 15th
     International Conference on Computer Sciences and Information Technologies (CSIT), Zbarazh,
     Ukraine, 2020, pp. 9-12, doi: 10.1109/CSIT49958.2020.9321940



                                                                                                     24