<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Simulated Datasets Generator for Testing Data Analytics Methods</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Serhii</forename><surname>Toliupa</surname></persName>
							<email>tolupa@i.ua</email>
							<affiliation key="aff0">
								<orgName type="institution">Taras Shevchenko National University of Kyiv</orgName>
								<address>
									<addrLine>64/13 Volodymyrska St</addrLine>
									<postCode>01601</postCode>
									<settlement>Kyiv</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Anna</forename><surname>Pylypenko</surname></persName>
							<email>anna.pylypenko@knu.ua</email>
							<affiliation key="aff0">
								<orgName type="institution">Taras Shevchenko National University of Kyiv</orgName>
								<address>
									<addrLine>64/13 Volodymyrska St</addrLine>
									<postCode>01601</postCode>
									<settlement>Kyiv</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Oleh</forename><surname>Tymchuk</surname></persName>
							<email>oleh.tymchuk@knu.ua</email>
							<affiliation key="aff0">
								<orgName type="institution">Taras Shevchenko National University of Kyiv</orgName>
								<address>
									<addrLine>64/13 Volodymyrska St</addrLine>
									<postCode>01601</postCode>
									<settlement>Kyiv</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Oleksii</forename><surname>Kohut</surname></persName>
							<email>oleksii_kohut@knu.ua</email>
							<affiliation key="aff0">
								<orgName type="institution">Taras Shevchenko National University of Kyiv</orgName>
								<address>
									<addrLine>64/13 Volodymyrska St</addrLine>
									<postCode>01601</postCode>
									<settlement>Kyiv</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Simulated Datasets Generator for Testing Data Analytics Methods</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">FA46A89A08AF2FA5DA2273F2808BE7FD</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:51+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Benchmark datasets creation</term>
					<term>synthetic dataset generator</term>
					<term>analysis methods</term>
					<term>extreme values</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This article explores the role of benchmark datasets in testing data analysis tools. The use of synthetic data generators is motivated by the need for scaling the size of training datasets, filling data gaps, and safeguarding data confidentiality. Existing research in this field emphasizes the importance of applications in various domains, such as fraud detection, healthcare, and computer graphics. The efficacy of constructing Gaussian distributions using the Box-Muller transform is investigated, while limitations in generating extreme values are highlighted. The integration of specialized distributions is proposed to address this gap and enhance dataset variability for improved performance in data analytics methods. The article presents a synthetic data generator capable of producing datasets for effective evaluation of machine learning methods. Practical tests demonstrate the software's effectiveness in creating test datasets with controlled variations. Four different tests were conducted, each with three different variants: 1) normal distribution parameters, namely standard deviation, 2) class imbalance, 3) missing values, and 4) outliers. The generated datasets were used to conduct controlled tests of logistic regression. Performance evaluation of the logistic regression model employed metrics such as Precision, Accuracy, Type 1 Error, Type 2 Error, and F1-Score.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The modern world generates an enormous volume of data every day, ranging from sensor data to textual information. Data analytics tools require the availability of realistic and representative datasets for testing that would reflect contemporary challenges in data processing. The relevance of this topic is also driven by the development of artificial intelligence and machine learning: the high demand for the development of machine learning and artificial intelligence algorithms necessitates data for training and evaluating these algorithms. Creating realistic datasets becomes critically important for precise assessment and comparison of various methods. The complexity of this problem is associated with its interdisciplinary nature. Analytical tools and methods are used across various domains, including medicine, finance, biology, ecology, and others. The creation of benchmark datasets can also contribute to addressing ethical concerns related to data privacy and security by developing anonymized or synthetic data. So, the creation of benchmark datasets for testing data analytics tools is essential for advancing science and technology in this field and ensuring their practical utility across diverse domains.</p><p>There are a lot of published research results that provides comprehensive insights into the process of creating benchmark datasets and its significance. In <ref type="bibr" target="#b0">[1]</ref>, the authors introduce the concept of benchmark metrics in machine learning for scientific purposes and review existing approaches. They underscore that the selection of the most appropriate machine learning algorithm for scientific data analysis remains a significant challenge due to the potential applicability of machine learning frameworks and models, computer architectures. The results of the research by authors Krizhevsky, Sutskever, and Hinton <ref type="bibr" target="#b1">[2]</ref> were the first to demonstrate the profound influence of datasets on the learning outcomes of deep neural networks. It can be instrumental in comprehending the significance of benchmarking. The article <ref type="bibr" target="#b2">[3]</ref> by M. Fernandez-Delgado, E. Cernadas, S. Barro and D. Amorim is a significant contribution to the field of machine learning and data classification, providing practical insights into the use of classifiers in real-world scenarios. The article explores various classification algorithms and compares their performance across different benchmark datasets, which can serve as a valuable resource for discussing the effectiveness of various analytical tools.</p><p>Each data generator program uses a unique approach to data creation. The article <ref type="bibr" target="#b4">[5]</ref> presents a data generator designed to fill in gaps that may exist in other programs. The developed system allows users to customize and create known statistical distributions to achieve the desired outcome. Additionally, it offers real-time data behavior visualization to analyze whether they possess the characteristics necessary for effective testing. In the articles <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>, the authors provide an overview of the design and architecture of the Information Discovery and Analysis Systems (IDAS) Data Set Generator (IDSG), which enables a fast and comprehensive evaluation of IDAS. IDSG generates data using statistical algorithms, rule-based algorithms, and semantic graphs that represent interdependencies between attributes. To illustrate this approach, an application for credit card transactions is used. Sran Popić et al. <ref type="bibr" target="#b7">[8]</ref> provided a brief overview of various types of generators in terms of their architecture and anticipated usage, as well as listed their advantages and disadvantages. They also presented a review of the data generation algorithms used and best practices in various domains.</p><p>Researchers in this field has attempted to assess the utility of synthetic data generators using various evaluation metrics. However, it has been found that these metrics lead to conflicting conclusions, complicating the direct comparison of synthetic data generators. In their study, Fida Kamal Dankar and colleagues <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10]</ref> identified four criteria for evaluating masked data by categorizing available utility metrics into different categories based on the information they seek to preserve: attribute fidelity, twodimensional parameter fidelity, population fidelity, and application fidelity. In the article <ref type="bibr" target="#b10">[11]</ref>, the authors have introduced several novel and efficient methods and multidimensional data structures that can enhance the decision-making process in various domains. They have examined online range aggregation, range selection, and weighted range median queries; for most of these, data structures and techniques are presented that can provide answers in near-polynomial time.</p><p>In practice, obtaining real data can be challenging due to confidentiality issues. Additionally, real data may not conform to specific characteristics required for evaluating new approaches under certain conditions. Given these constraints, the use of synthetic data becomes a viable alternative to supplement real data in various domains. For example, in the article <ref type="bibr" target="#b11">[12]</ref>, the authors described the process of generating synthetic data using the publicly available tool Benerator to mimic the distribution of aggregated statistical data obtained from the national population census. The generated datasets successfully replicated microdata containing records with social, economic, and demographic information. Forensics also requires testing digital information tools. Thomas Göbel et al. <ref type="bibr" target="#b12">[13]</ref> introduced the concept of a structure called hystck for creating synthetic datasets based on the ground truth. This framework supports automated generation of synthetic network traffic and artifacts of operating systems and software by simulating human-computer interactions. To preserve confidentiality, banks are unwilling to share fraud statistics and datasets with the public. To overcome these limitations, Ikram Ul Haq et al. <ref type="bibr" target="#b13">[14]</ref> introduced an innovative technique for generating uniformly distributed synthetic data (HCRUD) based on highly correlated rules. This technique allows the generation of synthetic datasets of any size, replicating the characteristics of restricted actual fraud data, thus supporting further research in fraud detection. Access to medical datasets is also complicated due to concerns about patient confidentiality. The development of synthetic datasets that are realistic enough for testing digital applications is considered as a potential alternative that allows their deployment. Theodoros Arvanitis et al. <ref type="bibr" target="#b14">[15]</ref> have devised a method for generating synthetic data statistically equivalent to real clinical datasets and have demonstrated that the approach based on Generative Adversarial Networks aligns with this goal. Thus, the concept of creating realistic medical synthetic datasets has been successfully validated. However, data quality issues exist both in real and synthetic data, with the latter reflecting real-world problems and artifacts created by synthetic datasets. The intellectual analysis of synthetic healthcare data represents a novel field with its unique challenges. According to Alistair Bullward et al. <ref type="bibr" target="#b15">[16]</ref>, researchers should be aware of the risks associated with extrapolating results from synthetic data studies to real-world scenarios and should evaluate outcomes using analysts who can review the underlying data. Synthetic data is frequently utilized in computer graphics, which is used for training computer vision models, as mentioned in <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref>. In many industrial computer vision tasks, deep learning methods such as convolutional neural networks have been successfully employed, as indicated by the works <ref type="bibr" target="#b18">[19,</ref><ref type="bibr" target="#b19">20]</ref>. In recent years, generative adversarial networks (GANs) have been effectively utilized for generating new realistic images and manipulating them, as noted in the research by I. H. Rather and S. Kumar <ref type="bibr" target="#b20">[21]</ref>.</p><p>Therefore, the increasing availability and utilization of data analytics tools make the standardization of benchmark datasets an essential task for their further adoption across various fields. The challenge of developing more objective metrics and methods for assessing the utility of synthetic data remains unresolved. One of these problems is adequate tail-end modeling of probability density functions. Extreme values can significantly impact risks and outcomes in several sectors such as finance, insurance, climatology, engineering, among others. This article presents research in the field of creating synthetic data that aligns with real-world requirements and enables their effective use for testing various analytical tools. The primary focus is on comprehending and analyzing characteristics of this generator, especially in the tail areas of the Gaussian probability density function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Mathematical methods</head><p>The generation of Gaussian distributions is foundational in various scientific and computational fields, playing a pivotal role in modeling natural phenomena and simulating random variables. To create a Gaussian distribution, it is suggested to use the Box-Muller transform <ref type="bibr" target="#b21">[22]</ref>. It produces a pair of Gaussian random numbers using a pair of uniform numbers. The fundamental principle of the Box-Muller algorithm lies in its ability to generate pairs of independent standard Gaussian random variables from uniformly distributed random numbers. This method leverages the polar coordinate representation in a two-dimensional space to transform pairs of independent uniformly distributed variables into sets of normally distributed variables. By employing trigonometric functions and geometric interpretations, this algorithm constructs Gaussian values by utilizing the magnitude and angle derived from uniformly generated random variables. The algorithm to generate the Gaussian samples:</p><p>1. Generate sample using two distinct uniform random number generators, 𝑢₁ and 𝑢₂.</p><p>2. Apply the inverse cumulative distribution function (CDF) of the exponential distribution to 𝑢₁ (𝜆 = 1):</p><formula xml:id="formula_0">𝑟 = √−2𝑙𝑛(1 − 𝑢 1 ) = √−2𝑙𝑛(𝑢 1 ), (<label>1</label></formula><formula xml:id="formula_1">)</formula><p>where 𝑟 is the distance from origin for each sample. For simplicity, 𝑢₁ is replaced by 1 − 𝑢₁ since they are both uniform samples on (0, 1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Apply the inverse CDF of the uniform distribution on</head><formula xml:id="formula_2">(0, 2𝜋) to 𝑢₂: 𝜃 = 2𝜋𝑢₂,<label>(2)</label></formula><p>where 𝜃 is the angle of the sample. 4. Finally, determine the x and y using basic trigonometric calculations: 𝑥 = 𝑟𝑐𝑜𝑠(𝜃), 𝑦 = 𝑟 𝑠𝑖𝑛(𝜃).</p><p>(3) The Box-Muller algorithm generates two independent random numbers upon each execution. Each pair of generated numbers represents two independent random variables following a standard normal distribution. These variables possess a mean of 0 and a standard deviation of 1. The produced values can be utilized as required for various statistical or computational purposes. For instance, both numbers from each pair can be employed to generate pairs of normally distributed random variables. Alternatively, a single number from each pair might suffice if the need is for a singular normally distributed value. By combining both sets of generated values into a single dataset, a larger and potentially more diverse sample can be constructed, enriching the dataset used for testing machine learning models. This integration allows for a broader spectrum of data points, enhancing the robustness of the evaluation process and potentially fortifying the model's generalization capabilities.</p><p>The Box-Muller algorithm, while efficient in generating core values from the standard normal distribution, may necessitate additional methods to generate values from the tails of the distribution <ref type="bibr" target="#b22">[23]</ref>. Extreme or outlier values that reside in the tails of the distribution are often critical for assessing rare events or evaluating the performance of algorithms. The generation of extreme values in random numbers can be achieved using specialized distributions. Some distributions, such as the exponential, Weibull, Fréchet extreme value distribution, among others, directly model extreme values. The proposed algorithm incorporates the following steps to generate such values.</p><p>Continuation of the algorithm to generate the Gaussian samples: 5. Set the parameter 𝑐 represents the shape parameter governing the tail behavior. It determines the shape and heaviness of the distribution's tails.</p><p>с &gt; 0: indicates a distribution with bounded tails. It implies that the distribution's tails are bounded, and the probability of encountering extreme values decreases more rapidly than in a normal distribution. с = 0: corresponds to the exponential distribution, where the tails are light, and the probability of extreme values decreases exponentially.</p><p>с &lt; 0: indicates a distribution with heavier tails than the exponential distribution. This suggests that the probability of encountering extreme values decreases slower than in an exponential distribution.</p><p>6. Establish the probability density function of the distribution is given by the formula:</p><formula xml:id="formula_3">𝑓(𝑡; 𝑐) = { 𝑒 −[1+𝑐𝑡] −1/𝑐 ⋅ (1 + 𝑐𝑡) −1/𝑐−1 for 𝑐 ≠ 0, 𝑒 −𝑒 −𝑡 ⋅ 𝑒 −𝑡 for 𝑐 = 0,<label>(4)</label></formula><p>use the inverse transform method to generate extreme values:</p><formula xml:id="formula_4">𝑡 = { 1 𝑐 (−𝑙𝑛(1 − 𝑢) −𝑐 − 1) for 𝑐 ≠ 0, −𝑙𝑛(−𝑙𝑛(1 − 𝑢) for 𝑐 = 0, (<label>5</label></formula><formula xml:id="formula_5">)</formula><p>where 𝑢 is a random variable from a uniform distribution on the interval (0, 1) and can be obtained as a fraction of random variables generated in step 1. Validate the generated extreme values to ensure they align with the expected tail behavior. </p><p>Depending on the practical task there might be a need to shift the distribution along the axis of values or change its scale. In such cases, it would be advantageous to employ the Generalized Extreme Value (GEV) distributions, which combine the Gumbel, Fréchet, and Weibull families, also known as type I, II, and III extreme value distributions <ref type="bibr" target="#b23">[24]</ref>. These distributions offer flexibility in adjusting the positioning and scaling of the distribution to accommodate various scenarios and analyses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Development of synthetic dataset generator</head><p>In the domain of data generation for machine learning research, a spectrum of tools and libraries, including MOSTLY.AI (https://mostly.ai/), Mockaroo (https://www.mockaroo.com/), and Scikit-learn (https://scikit-learn.org/stable/datasets/sample_generators.html), among others, is at researchers' disposal. These tools furnish a fundamental framework for the generation of synthetic datasets, enabling users to craft data with predetermined statistical distributions. Nevertheless, they exhibit inherent limitations. For instance, scikit-learn is primarily constrained by its capacity to generate synthetic datasets featuring a limited array of distributions and parameters, rendering it less suitable for the creation of intricate or verisimilar data representations. These tools primarily target numerical data and lack the specialized capabilities required for the synthesis of structured data types, such as textual information, categorical attributes, or time series. In the subsequent sections of this article, the utilization of synthetic data will be examined primarily in the context of testing machine learning methods, considering the noted shortcomings of existing tools.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Data model</head><p>The specific capabilities of a synthetic data generator may vary from implementation. However, such software should have the following main features:  defining the name, description, and additional information associated with the model;  describing all dimensions of the model, including their data type, name, description, and most importantly, the algorithm for generating the value;  export synthetic data rows to a file or database based on the created model. Each data model must have at least one dimension. Each dimension should be defined by the following characteristics: dimension name; data type (integer or real number, category, string); expression/formula that defines how the data will be filled in; additional options (presence of blank values/outliers). Values in columns can be either independent expressions or calculated based on values in other columns (while avoiding cyclic dependencies, where, for instance, expression A relies on the value from column B, and expression B relies on the value from column A). The data model serves as an abstraction of a dataset, comprising specifications that characterize the behavior of the data <ref type="bibr" target="#b4">[5]</ref>. Figure <ref type="figure" target="#fig_1">1</ref> depicts an example of rudimentary data model illustrating the characterization of specific individuals for the purpose of analyzing patient ages. There are three dimensions:</p><p>1) Name: an informative dimension, intended more for identifying the string than for analyzing the data. Such strings are not useful in machine learning, but if it were real data, there would be a privacy issue. Given that this string is generated in a random manner, there is no need for concern regarding the utilization of personal data belonging to individuals. Since this dimension is defined without modifiers, the data in this row will always be present.</p><p>2) Age: dimension determines the patient's age. Its expression defines a random value in a normal distribution with a mean of 14.0 and a variance of 3.0. There are no blank fields.</p><p>3) IsAdult: dimension determines whether the patient is an adult. This is the only dimension that uses another dimension in its calculations. If the patient is over 18, this field is set to 1, otherwise 0. There are no blank fields. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Modelling of the generator</head><p>Editing a data model includes editing general information about the model, such as the name, description, and availability to other users. Most importantly, in this mode, you can edit each dimension separately, with a real-time check for the correctness of the entered data. Dimensions can be added or deleted, but each model will always have at least 1 dimension with a line number. Each dimension must be given a unique name that will not match other dimensions within the model. All dimensions must be one of the defined data types (see Table1). The lists of operands and functions available for generation is described in Tables <ref type="table" target="#tab_2">2-4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>The </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Usage scenarios: generating test datasets</head><p>To verify usefulness of such a software, practical tests were performed with the help of the program. A data model was specified in the generator and multiple tweaks were applied to it. The data set was created using the proposed algorithm to generate the Gaussian samples (1) -( <ref type="formula" target="#formula_6">6</ref>). The default model is described as follows:  two classes;  two dimensions: one dimension for the class, another for the value;  value is generated depending on class with the help of ternary operator and normal distribution;  no outliers, no blank values.</p><p>The dataset was specified on the data model tab. For each test a CSV file with 1000 entries was generated. Then, with the help of Python, graphs for this data were drawn. This provides a visual clue about how the change affects the data. A total of four different tests were performed, each with three different variations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1.">Changing parameters of normal distribution</head><p>In this test, the standard deviation of normal distribution for both classes were changed. The values chosen are 5.0, 3.0, 1.0. The expected result of such change is that values would become less dispersed across the axis. The result of the software for this case is presented in Figures <ref type="figure" target="#fig_3">2-3</ref>.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2.">Changing class distribution</head><p>Here, a weighted category was introduced in place of default category. For class 1, the probability of appearance will decrease each time, and for class 2 it will increase. The rate of change is 10%. So, the first distribution of classes will be 50-50%, then 40-60%, then 30-70% (Figures <ref type="figure" target="#fig_5">4-5</ref>).</p><p>As can be seen on the Figure <ref type="figure" target="#fig_5">5</ref>, the amount of class2 entries in getting higher with each next picture, confirming that this feature works.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.3.">Adding outliers</head><p>To use outliers, an additional dimension was added. A random value is calculated with the help of uniform distribution, and if it's less than certain threshold (which equals probability of such event), then the value would be multiplied by 5 (Figures <ref type="figure" target="#fig_7">6-7</ref>). Otherwise, the expected value will be placed. It's clear that with the increasing probability of outliers appearing, number of outliers will be bigger. Also, it's worth noting that class2 outliers reach higher values than class1 because of bigger base value. This creates additional separation between class1 and class2 in the higher values. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.4.">Setting up missing values</head><p>It's possible to make missing values in the dataset in a similar manner to outliers, with the help of an additional dimension. The only change is that missing() function is used instead of multiplication (Figures <ref type="figure" target="#fig_8">8-9</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Software description 3.4.1. Module description</head><p>Software implementation of this synthetic dataset generator (lexData) consists of the three main parts: tokenizer, parser, and calculator. The input of this system is text expression describing formula for the dimension value. Figure <ref type="figure" target="#fig_1">10</ref> depicts how the text input can be processed into result:</p><p>In this simple example, there are multiple steps. In the first step, the input string is parsed into multiple tokens. A token is an atomic part of the mathematical formula. A single plus sign is a token, but a whole number or variable name is also a token, since we can't just divide the number into digits. After that, the sequence of tokens is handled by the parser, which builds a complete function. After passing concrete values for parameters (which may be other dimensions) specified in the function, the final value can be calculated and returned as the result. However, there are many more details such as verifying if referenced function exists or if there is no circular dependency. The output of calculation is just a single number or category, depending on the expression specified. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.2.">Constraints and decisions</head><p>The .NET Framework and programming language C# were used by the designer of generator software. The generator core is implemented on top of custom-created library for parsing and evaluating math expressions, called lexCalculator. To integrate it with this software, it was modified to support such functions:  generating random distributions;  string data types;  logical expressions;  complex data types and lists.</p><p>With the help of NUnit, a unit testing was performed on this software. Most of the possible test cases were checked and tests provided more than 90% of code coverage.</p><p>The system was tested on the following system specifications:  Windows 10 Pro;  Intel(R) Core(TM) i5-8350U CPU;  16GB RAM;  NVMe SK hynix 256GB SSD. With such specifications, it was found out how the generator performed on the different datasets.</p><p>The following table describes how many rows were generated per second on average for the specified dataset. Each dimension is just a simple switch between classes with normal distribution calculation:</p><p>As can be seen in Table <ref type="table">5</ref>, the number of classes doesn't affect performance of generation too much, except for the parsing stage, where more classes mean there are more possibilities to handle. As for the dimensions, these directly affect the performance, but that also depends on the expressions specified for these dimensions. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Discussion</head><p>Generators of synthetic datasets represent a potent tool for conducting controlled experiments and investigating the performance of machine learning methods in various scenarios. They facilitate an enhanced understanding of the capabilities and limitations of these methods. For instance, in this study, the discussed synthetic datasets with known characteristics and distributions were utilized for the purpose of conducting controlled tests of logistic regression, as illustrated in Table <ref type="table" target="#tab_3">6</ref>. These benchmark datasets can encompass diverse data complexities, such as class imbalance, missing values, and outliers, thereby aiding in assessing how effectively logistic regression operates under different conditions and whether additional tuning is necessary. To assess the effectiveness of the logistic regression model, the following metrics were employed:</p><p>1. Precision is the ability of the classifier not to label as positive a sample that is negative. In table 6 this metric provides an assessment of the model's overall precision, considering the weights of each class based on their distribution in the dataset. This allows for accounting for class imbalance, where one class may have significantly more instances than others.</p><p>2. Accuracy. This metric indicates the accuracy of a classification model, measuring the overall percentage of correct predictions (both positive and negative) out of the total number of examples in dataset:</p><formula xml:id="formula_7">𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦(𝑦, 𝑦 ̂) = 1 𝑛 samples ∑ 1(𝑦 ̂𝑖 = 𝑦 𝑖 ) 𝑛 samples −1 𝑖=0 ,<label>(7)</label></formula><p>where 𝑦 ̂𝑖 is the predicted value of the 𝑖-th sample, 𝑦 𝑖 is the corresponding true value, 1(𝑥) is the indicator function.</p><p>3. Type 1 Error. This metric indicates the precision of the model for the class denoted as "class1" or the positive class. Type 1 Error measures the percentage of correct positive predictions made by the model among all positive predictions:</p><formula xml:id="formula_8">𝑇𝑦𝑝𝑒 1 𝐸𝑟𝑟𝑜𝑟 = 𝑡𝑝 𝑡𝑝 + 𝑓𝑝 ,<label>(8)</label></formula><p>where 𝑡𝑝 (true positive) is correct result, 𝑓𝑝 (false positive) is unexpected result.</p><p>4. Type 2 Error. This metric typically represents the proportion of false negative predictions made by a classification model, specifically in the context of binary classification tasks. Type 2 Error quantifies the rate at which the model incorrectly predicts negative outcomes, providing insights into its ability to avoid missing positive cases:</p><formula xml:id="formula_9">𝑇𝑦𝑝𝑒 2 𝐸𝑟𝑟𝑜𝑟 = 1 − 𝑡𝑝 𝑡𝑝 + 𝑓𝑛 ,<label>(9)</label></formula><p>where 𝑡𝑝 (true positive) is correct result, 𝑓𝑛 (false negative) is missing result. 5. F1-Score. This metric is a combination of Precision and Recall and is used to assess the balance between these two metrics. It is particularly useful in situations where there is class imbalance (different numbers of instances for different classes) because it considers both Precision and Recall for each class and calculates their weighted harmonic mean:</p><formula xml:id="formula_10">𝐹1_𝑆𝑐𝑜𝑟𝑒 = 2 * 𝑃(𝑦, 𝑦 ̂) × 𝑅(𝑦, 𝑦 ̂) 𝑃(𝑦, 𝑦 ̂) + 𝑅(𝑦, 𝑦 ̂),<label>(10)</label></formula><p>where 𝑃(𝑦, 𝑦 ̂) is precision, 𝑅(𝑦, 𝑦 ̂) is recall.</p><p>Leveraging synthetic dataset generators enables rapid iteration and testing of various hypotheses and model parameters without the need to wait for real data. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>The article explores the construction of Gaussian distributions using the Box-Muller transform, a method relying on uniform random numbers to generate pairs of independent Gaussian variables. However, while efficient for core Gaussian values, the Box-Muller algorithm may fall short in generating extreme or outlier values crucial for evaluating rare events. To address this limitation, the article proposes incorporating specialized distributions to generate extreme values in tails. By combining these extreme values with standard normal random variables, a broader dataset can be formed, enriching evaluations and bolstering machine learning models.</p><p>This study also introduces a synthetic data generator designed for evaluating data visualization methods and machine learning systems. The application is highly adaptable, providing users with the capability to create and store models, generate artificial data, and even explore models created by other users. This enhanced accessibility promotes collaborative learning and testing of machine learning models on data. This article also demonstrates the practical utility of the application in the realm of assessing machine learning algorithms. It illustrates the process of generating various datasets, enabling precise control over typical challenges encountered in machine learning tasks. The visual representations created for each scenario provide compelling evidence of the tool's reliability in validating diverse situations.</p><p>In future research endeavors, the exploration of employing machine learning techniques to enhance the realism of synthetic data by introducing common noise patterns observed in real-world data, while preserving the fundamental distribution, can be considered. Additionally, another avenue of investigation involves the generation of synthetic data encompassing categorical, time-series, or mixed data types. This would enable the utilization of the generated synthetic data within the context of the Computing with Words Model <ref type="bibr" target="#b24">[25,</ref><ref type="bibr" target="#b25">26]</ref> and other fuzzy set models.</p><p>In conclusion, the creation of benchmark datasets for testing data analytics tools represents a crucial step in the advancement of data science and machine learning research, offering a standardized means of evaluating and comparing the performance of various analytical methodologies. These benchmark datasets not only facilitate fair and rigorous assessment of data analytics tools but also open avenues for future research in the refinement of synthetic data generation techniques and the development of more comprehensive and realistic benchmark datasets.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>7 .</head><label>7</label><figDesc>After obtaining the standard normal random variables and extreme values 𝑧 ∶= 𝑐𝑜𝑛𝑐𝑎𝑡𝑒𝑛𝑎𝑡𝑒(𝑥, 𝑦, 𝑡), can move on to a normally distributed value with mathematical expectation 𝜇 and standard deviation 𝜎: 𝜉 = 𝜇 + 𝜎𝑧.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Example of data model</figDesc><graphic coords="5,72.00,357.86,451.00,132.30" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Data model options with standard deviation changes</figDesc><graphic coords="7,72.00,218.28,451.00,98.30" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Resulting data graphs with standard deviation changes (original.csv, std3.csv, std1.csv) For the current test, a swarm plot was used. The color represents the class of the item, and its position on the Y axis represents the value. With each next plot the values are getting more packed around central value, confirming that standard deviation is indeed changing.</figDesc><graphic coords="7,72.00,340.71,451.00,226.90" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Data model options with class distribution changes</figDesc><graphic coords="8,72.00,72.00,451.00,95.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Resulting data graphs with class distribution changes (original.csv, 40-60.csv, 30-70.csv)</figDesc><graphic coords="8,72.00,194.08,451.00,227.85" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Data model options with added outliers</figDesc><graphic coords="8,72.00,547.84,451.00,128.25" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Resulting data graphs with outliers (outliers5.csv, outliers10.csv, outliers20.csv)</figDesc><graphic coords="9,72.00,72.00,451.00,192.65" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: Data model options with blank values For this test, an event plot was used instead of swarm plot. Colored lines represent objects generated (from 1 to 1000) with corresponding class. Each black line represents a missing value.</figDesc><graphic coords="9,72.00,374.21,451.00,129.30" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>Figure 9 :Figure 10 :</head><label>910</label><figDesc>Figure 9: Resulting data graphs with blank values (original.csv, blank10.csv, blank20.csv)</figDesc><graphic coords="10,72.00,72.00,451.00,241.65" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Supported data types</figDesc><table><row><cell>Data type</cell><cell>Description</cell><cell>Examples</cell></row><row><cell>String</cell><cell>A sequence of symbols with variable length</cell><cell>"Oleksii", "Kohut", "test123"</cell></row><row><cell>Integer</cell><cell>𝑥 ∈ ℤ</cell><cell>-2, -1, 0, 1, 2...</cell></row><row><cell>Real</cell><cell>𝑥 ∈ ℝ</cell><cell>0.5, 1.3e22, 3.14, -0.001</cell></row><row><cell>Category</cell><cell>Custom data type, choose one item from given</cell><cell>(1, 2, 3),</cell></row><row><cell></cell><cell>list</cell><cell>("apple", "banana", "orange")</cell></row><row><cell>Boolean</cell><cell>Logical data type (example usage of category</cell><cell></cell></row><row><cell></cell><cell>type)</cell><cell></cell></row></table><note>(0, 1), ("true", "false")</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>Performance testing of the software</figDesc><table><row><cell>Dataset description</cell><cell>Rows/s</cell></row><row><cell>2 classes, 2 dimensions</cell><cell>16534 rows</cell></row><row><cell>2 classes, 4 dimensions</cell><cell>10233 rows</cell></row><row><cell>2 classes, 8 dimensions</cell><cell>7610 rows</cell></row><row><cell>3 classes, 2 dimensions</cell><cell>15669 rows</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 6</head><label>6</label><figDesc>Classification comparison</figDesc><table><row><cell>Dataset</cell><cell>Precision</cell><cell>Accuracy</cell><cell>Type 1 Error</cell><cell>Type 2 Error</cell><cell>F1-Score</cell></row><row><cell>original.csv</cell><cell>0.860633</cell><cell>0.860000</cell><cell>0.844660</cell><cell>0.876289</cell><cell>0.859972</cell></row><row><cell>std1.csv</cell><cell>1.000000</cell><cell>1.000000</cell><cell>1.000000</cell><cell>1.000000</cell><cell>1.000000</cell></row><row><cell>std3.csv</cell><cell>0.955073</cell><cell>0.955000</cell><cell>0.961905</cell><cell>0.947368</cell><cell>0.955012</cell></row><row><cell>40-60.csv</cell><cell>0.865830</cell><cell>0.865000</cell><cell>0.876712</cell><cell>0.858268</cell><cell>0.863560</cell></row><row><cell>30-70.csv</cell><cell>0.864803</cell><cell>0.865000</cell><cell>0.862745</cell><cell>0.865772</cell><cell>0.860449</cell></row><row><cell>blank10.csv</cell><cell>0.856354</cell><cell>0.856354</cell><cell>0.858696</cell><cell>0.853933</cell><cell>0.856354</cell></row><row><cell>blank20.csv</cell><cell>0.885321</cell><cell>0.885350</cell><cell>0.884058</cell><cell>0.886364</cell><cell>0.885190</cell></row><row><cell>outliers5.csv</cell><cell>0.865167</cell><cell>0.865000</cell><cell>0.870968</cell><cell>0.859813</cell><cell>0.864888</cell></row><row><cell>outliers10.csv</cell><cell>0.860000</cell><cell>0.860000</cell><cell>0.871560</cell><cell>0.846154</cell><cell>0.860000</cell></row><row><cell>outliers20.csv</cell><cell>0.868215</cell><cell>0.865000</cell><cell>0.831776</cell><cell>0.903226</cell><cell>0.864848</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Scientific machine learning benchmarks</title>
		<author>
			<persName><forename type="first">J</forename><surname>Thiyagalingam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Shankar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Fox</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Hey</surname></persName>
		</author>
		<idno type="DOI">10.1038/s42254-022-00441-7</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1038/s42254-022-00441-7" />
	</analytic>
	<monogr>
		<title level="j">Nature Reviews Physics</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="413" to="420" />
			<date type="published" when="2022-04">Apr. 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">ImageNet classification with deep convolutional neural networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Krizhevsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E</forename><surname>Hinton</surname></persName>
		</author>
		<idno type="DOI">10.1145/3065386</idno>
		<ptr target="https://doi.org/10.1145/3065386" />
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">60</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="84" to="90" />
			<date type="published" when="2012-05">May 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Do we need hundreds of classifiers to solve real world classification problems</title>
		<author>
			<persName><surname>Fernández-Delgadomanuel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Barrosenén</forename><surname>Cernadaseva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Amorimdinani</forename></persName>
		</author>
		<idno type="DOI">10.5555/2627435.2697065</idno>
		<ptr target="https://doi.org/10.5555/2627435.2697065" />
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<date type="published" when="2014-01">Jan. 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Secure and Robust Machine Learning for Healthcare: A Survey</title>
		<author>
			<persName><forename type="first">A</forename><surname>Qayyum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qadir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bilal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Al-Fuqaha</surname></persName>
		</author>
		<idno type="DOI">10.1109/rbme.2020.3013489</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1109/rbme.2020.3013489" />
	</analytic>
	<monogr>
		<title level="j">IEEE Reviews in Biomedical Engineering</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="156" to="180" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools</title>
		<author>
			<persName><forename type="first">P</forename><surname>Mendonca</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Brito</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gustavo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Santos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Araujo</surname></persName>
		</author>
		<idno type="DOI">10.1109/access.2020.2991949</idno>
		<ptr target="https://doi.org/10.1109/access.2020.2991949" />
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page" from="82917" to="82928" />
			<date type="published" when="2020-01">Jan. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">R</forename><surname>Jeske</surname></persName>
		</author>
		<idno type="DOI">10.1145/1081870.1081969</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1145/1081870.1081969" />
	</analytic>
	<monogr>
		<title level="j">Knowledge Discovery and Data Mining</title>
		<imprint>
			<date type="published" when="2005-08">Aug. 2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Development of a Synthetic Data Set Generator for Building and Testing Information Discovery Systems</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Lin</surname></persName>
		</author>
		<ptr target="https://ieeexplore.ieee.org/abstract/document/1611688" />
	</analytic>
	<monogr>
		<title level="j">IEEE Xplore</title>
		<imprint>
			<date type="published" when="2006-01">Apr. 01, 2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Data generators: a short survey of techniques and use cases with focus on testing</title>
		<author>
			<persName><forename type="first">S</forename><surname>Popic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Pavkovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Velikic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Teslic</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICCE-BERLIN47944.2019.8966202</idno>
		<ptr target="https://doi.org/10.1109/ICCE-BERLIN47944.2019.8966202" />
	</analytic>
	<monogr>
		<title level="m">IEEE 9th International Conference on Consumer Electronics</title>
				<meeting><address><addrLine>Berlin</addrLine></address></meeting>
		<imprint>
			<publisher>ICCE</publisher>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation</title>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">K</forename><surname>Dankar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ibrahim</surname></persName>
		</author>
		<idno type="DOI">10.3390/app11052158</idno>
		<ptr target="https://doi.org/10.3390/app11052158" />
	</analytic>
	<monogr>
		<title level="j">Applied Sciences</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page">2158</biblScope>
			<date type="published" when="2021-02">Feb. 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A Multi-Dimensional Evaluation of Synthetic Data Generators</title>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">K</forename><surname>Dankar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">K</forename><surname>Ibrahim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Ismail</surname></persName>
		</author>
		<idno type="DOI">10.1109/access.2022.3144765</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1109/access.2022.3144765" />
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="11147" to="11158" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Multidimensional Data Structures and Techniques for Efficient Decision Making</title>
		<author>
			<persName><forename type="first">Madalina</forename><surname>Andreica</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ionut</forename><surname>Mugurel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nicolae</forename><surname>Andreica</surname></persName>
		</author>
		<author>
			<persName><surname>Cataniciu</surname></persName>
		</author>
		<idno>⟨hal-00467676⟩</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 10th WSEAS International Conference on Mathematics and Computers in Business and Economics (MCBE)</title>
				<meeting>the 10th WSEAS International Conference on Mathematics and Computers in Business and Economics (MCBE)<address><addrLine>Prague, Czech Republic</addrLine></address></meeting>
		<imprint>
			<publisher>ISBN</publisher>
			<date type="published" when="2009-03">Mar 2009</date>
			<biblScope unit="page" from="249" to="254" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Synthetic Data Generation using Benerator Tool</title>
		<author>
			<persName><forename type="first">V</forename><surname>Ayala-Rivera</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mcdonagh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Cerqueus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Murphy</surname></persName>
		</author>
		<ptr target="https://www.researchgate.net/publication/258125711_Synthetic_Data_Generation_using_Benera" />
		<imprint>
			<date type="published" when="2013-10">Oct. 2013</date>
		</imprint>
		<respStmt>
			<orgName>Cornell University</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">arXiv</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">A Novel Approach for Generating Synthetic Datasets for Digital Forensics</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">W</forename><surname>Göbel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Schäfer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Julien</forename><surname>Hachenberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Türr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Baier</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-56223-6_5</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1007/978-3-030-56223-6_5" />
		<imprint>
			<date type="published" when="2020-01">Jan. 2020</date>
			<biblScope unit="page" from="73" to="93" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Generating Synthetic Datasets for Experimental Validation of Fraud Detection</title>
		<author>
			<persName><forename type="first">Ul</forename><surname>Haq</surname></persName>
		</author>
		<author>
			<persName><surname>Ikram</surname></persName>
		</author>
		<author>
			<persName><surname>Gondal</surname></persName>
		</author>
		<author>
			<persName><surname>Vamplew</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robert</forename><surname>Peter &amp; Layton</surname></persName>
		</author>
		<ptr target="https://www.researchgate.net/publication/316878436_Generating_Synthetic_Datasets_for_Experimental_Validation_of_Fraud_Detection" />
	</analytic>
	<monogr>
		<title level="m">Fourteenth Australasian Data Mining Conference</title>
		<title level="s">Conferences in Research and Practice in Information Technology</title>
		<meeting><address><addrLine>Canberra, Australia</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="volume">170</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">A method for machine learning generation of realistic synthetic datasets for validating healthcare applications</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">N</forename><surname>Arvanitis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>White</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Harrison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Chaplin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Despotou</surname></persName>
		</author>
		<idno type="DOI">10.1177/14604582221077000</idno>
		<ptr target="https://doi.org/10.1177/14604582221077000" />
	</analytic>
	<monogr>
		<title level="j">Health Informatics Journal</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page">146045822210770</biblScope>
			<date type="published" when="2022-01">Jan. 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Research Paper: Process Mining and Synthetic Health Data: Reflections and Lessons Learnt</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bullward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Aljebreen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Coles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Mcinerney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Johnson</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-031-27815-0_25</idno>
		<ptr target="https://doi.org/10.1007/978-3-031-27815-0_25" />
	</analytic>
	<monogr>
		<title level="m">Process Mining Workshops. ICPM 2022</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Montali</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Senderovich</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Weidlich</surname></persName>
		</editor>
		<meeting>ess Mining Workshops. ICPM 2022<address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">468</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Creating Synthetic Training Data for Machine Vision Quality Gates</title>
		<author>
			<persName><forename type="first">I</forename><surname>Gräßler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hieb</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Roesmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Unverzagt</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-662-66769-9_7</idno>
		<ptr target="https://doi.org/10.1007/978-3-662-66769-9_7" />
	</analytic>
	<monogr>
		<title level="m">Bildverarbeitung in der Automation. Technologien für die intelligente Automation</title>
				<editor>
			<persName><forename type="first">V</forename><surname>Lohweg</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer Vieweg</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">17</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Generating real-world-like labelled synthetic datasets for construction site applications</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Barrera-Animas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Davila Delgado</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.autcon.2023.104850</idno>
		<ptr target="https://doi.org/10.1016/j.autcon.2023.104850" />
	</analytic>
	<monogr>
		<title level="j">Automation in Construction</title>
		<imprint>
			<biblScope unit="volume">151</biblScope>
			<biblScope unit="page">104850</biblScope>
			<date type="published" when="2023-07">Jul. 2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Synthetic datasets for Deep Learning in computer-vision assisted tasks in manufacturing</title>
		<author>
			<persName><forename type="first">C</forename><surname>Manettas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Nikolakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Alexopoulos</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.procir.2021.10.038</idno>
		<idno>doi:</idno>
		<ptr target="https://doi.org/10.1016/j.procir.2021.10.038" />
	</analytic>
	<monogr>
		<title level="j">Procedia CIRP</title>
		<imprint>
			<biblScope unit="volume">103</biblScope>
			<biblScope unit="page" from="237" to="242" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Generation of Synthetic AI Training Data for Robotic Grasp-Candidate Identification and Evaluation in Intralogistics Bin-Picking Scenarios</title>
		<author>
			<persName><forename type="first">D</forename><surname>Holst</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schoepflin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Schüppstuhl</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-031-18326-3_28</idno>
		<ptr target="https://doi.org/10.1007/978-3-031-18326-3_28" />
	</analytic>
	<monogr>
		<title level="m">Flexible Automation and Intelligent Manufacturing: The Human-Data-Technology Nexus . FAIM 2022</title>
		<title level="s">Lecture Notes in Mechanical Engineering</title>
		<editor>
			<persName><forename type="first">K</forename><forename type="middle">Y</forename><surname>Kim</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Monplaisir</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Rickli</surname></persName>
		</editor>
		<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Generative adversarial network based synthetic data training model for lightweight convolutional neural networks</title>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">H</forename><surname>Rather</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kumar</surname></persName>
		</author>
		<idno type="DOI">10.1007/s11042-023-15747-6</idno>
		<ptr target="https://doi.org/10.1007/s11042-023-15747-6" />
	</analytic>
	<monogr>
		<title level="j">Multimed Tools Appl</title>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">A Note on the Generation of Random Normal Deviates</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">E P</forename><surname>Box</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">E</forename><surname>Muller</surname></persName>
		</author>
		<idno type="DOI">10.1214/aoms/1177706645</idno>
		<ptr target="https://doi.org/10.1214/aoms/1177706645" />
	</analytic>
	<monogr>
		<title level="j">The Annals of Mathematical Statistics</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="610" to="611" />
			<date type="published" when="1958-06">Jun. 1958</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Gaussian random number generators</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">B</forename><surname>Thomas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Luk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">H W</forename><surname>Leong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Villasenor</surname></persName>
		</author>
		<idno type="DOI">10.1145/1287620.1287622</idno>
		<ptr target="https://doi.org/10.1145/1287620.1287622" />
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="issue">4</biblScope>
			<date type="published" when="2007-11">Nov. 2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Extreme Value Distributions: An Overview of Estimation and Simulation</title>
		<author>
			<persName><forename type="first">A</forename><surname>Albashir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Mohd</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Noratiqah</forename><surname>Ibrahim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ariff</forename><surname>Mohd</surname></persName>
		</author>
		<idno type="DOI">10.1155/2022/5449751</idno>
		<ptr target="https://doi.org/10.1155/2022/5449751" />
	</analytic>
	<monogr>
		<title level="j">Journal of Probability and Statistics</title>
		<imprint>
			<biblScope unit="volume">2022</biblScope>
			<biblScope unit="page" from="1" to="17" />
			<date type="published" when="2022-10">Oct. 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Forecasting of Categorical Time Series Using Computing with Words Model</title>
		<author>
			<persName><forename type="first">O</forename><surname>Tymchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pylypenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Iepik</surname></persName>
		</author>
		<ptr target="https://ceur-ws.org/Vol-3384/Short_2.pdf" />
	</analytic>
	<monogr>
		<title level="m">Selected Papers of the IX International Scientific Conference &apos;Information Technology and Implementation&apos; (IT&amp;I-2022), Workshop Proceedings</title>
				<meeting><address><addrLine>Kyiv, Ukraine</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022-12-02">November 30 -December 02, 2022</date>
			<biblScope unit="volume">3384</biblScope>
			<biblScope unit="page" from="151" to="159" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Input Data Clustering for the Efficient Operation of Renewable Energy Sources in a Distributed Information System</title>
		<author>
			<persName><forename type="first">N</forename><surname>Kiktev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Osypenko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shkurpela</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Balaniuk</surname></persName>
		</author>
		<idno type="DOI">10.1109/CSIT49958.2020.9321940</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE 15th International Conference on Computer Sciences and Information Technologies (CSIT)</title>
				<meeting><address><addrLine>Zbarazh, Ukraine</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="9" to="12" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
