<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PuckTrick: A Library for Making Synthetic Data More Realistic</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandra Agostini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Maurino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Blerina Spahiu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics, Systems and Communication University of Milan -Bicocca Viale Sarca 336</institution>
          ,
          <addr-line>20126 Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The increasing reliance on machine learning (ML) models for decision-making requires high-quality training data. However, access to real-world datasets is often restricted due to privacy concerns, proprietary restrictions, and incomplete data availability. As a result, synthetic data generation (SDG) has emerged as a viable alternative, enabling the creation of artificial datasets that preserve the statistical properties of real data while ensuring privacy compliance. Despite its advantages, synthetic data is often overly clean and lacks real-world imperfections, such as missing values, noise, outliers, and misclassified labels, which can significantly impact model generalization and robustness. To address this limitation, we introduce Pucktrick, a Python library designed to systematically contaminate synthetic datasets by introducing controlled errors. The library supports multiple error types, including missing data, noisy values, outliers, label misclassification, duplication, and class imbalance, ofering a structured approach to evaluating ML model resilience under real-world data imperfections. Pucktrick provides two contamination modes: one for injecting errors into clean datasets and another for further corrupting already contaminated datasets. Through extensive experiments on real-world financial datasets, we evaluate the impact of systematic data contamination on model performance. Our findings demonstrate that ML models trained on contaminated synthetic data outperform those trained on purely synthetic, error-free data, particularly for tree-based and linear models such as SVMs and Extra Trees.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The rapid advancement of technology in recent years has facilitated the eficient and large-scale
gathering, processing, and analysis of data for informed decision-making across various domains.
As of 2024, approximately 402.74 million terabytes (or 0.4 zettabytes) of data are created each
day, totaling around 147 zettabytes annually1. This daily data generation is projected to increase
to about 181 zettabytes per year by 20252. The rapid evolution of statistical analysis and pattern
recognition techniques has revolutionized the ability to extract meaningful insights from such
data. However, the efectiveness of these methods depends on the quality of the data being
analyzed. When data is incomplete, inconsistent, duplicated, or lacks proper security measures,
the accuracy and reliability of the results are significantly compromised.</p>
      <p>
        One of the major challenges in machine learning is the availability of real dataset for training
the model [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In many applied contexts, it is not always possible to use real data due to
privacy concerns or the unwillingness of data owners to share their proprietary information.
To address both data quality and privacy concerns, synthetic data has emerged as a trade-of
solution [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Synthetic data replicates the statistical properties of real data while eliminating
direct exposure of sensitive information. It also helps mitigate the issue of missing values
by generating artificial replacements, balancing datasets by creating synthetic examples for
underrepresented groups, and providing an added layer of security since the generated data does
not correspond to real-world individuals or entities. Existing models focus on creating synthetic
data by leveraging techniques such as generative adversarial networks (GANs)[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], variational
autoencoders (VAEs)[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and statistical sampling methods [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to replicate the properties of
real-world data distributions.
      </p>
      <p>
        Once a synthetic dataset is generated, its quality is typically assessed by training a preliminary
machine learning model and evaluating performance metrics such as accuracy, F1-score,
precision, and recall. These metrics help determine whether the synthetic data preserves meaningful
patterns from real-world data. By performing this initial assessment with a lightweight model
or proxy task, researchers can estimate the dataset’s utility before committing to the expensive
and time-consuming process of full-scale model training [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, this approach overlooks
a critical aspect: real-world data is inherently noisy and imperfect. For example, a medical
sensor may malfunction and provide incorrect measurements. A change in the IT system that
feeds the dataset could result in missing data or values on a diferent scale than the original ones.
Finally, experts might misclassify certain data points. Synthetic datasets, typically designed to
be clean and well-structured, may fail to capture these real-world inconsistencies. As a result,
models trained on idealized synthetic data risk poor generalization when applied to real-world
scenarios, where unpredictable data variations and errors are common [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This gap highlights
the importance of incorporating realistic noise and imperfections into synthetic data to improve
its applicability in practical machine-learning applications.
      </p>
      <p>To address this limitation, we developed a python library called Pucktrick designed to
systematically and controllably contaminate datasets, thereby generating more realistic training data.
The Pucktrick library simulates errors that can be encountered in a dataset used for training
models. The most common types of errors encountered in real-world datasets for machine
learning training include:
• Missing Data: Instances where values are absent, leading to biases and potential
inaccuracies in predictions.
• Noisy Data: Random errors or fluctuations in the data that obscure underlying patterns
and degrade model performance.
• Outliers: Data points that significantly deviate from the norm, potentially skewing model
predictions.
• Imbalanced Data: Unequal representation of classes, leading to bias in favor of the majority
class.
• Duplicate Data: Repeated or nearly identical entries that can lead to overfitting and
ineficient training.
• Incorrect Labels: Mislabelled instances that negatively impact classification performance
and model reliability.</p>
      <p>The Pucktrick library ofers two modes of operation. The first mode introduces errors into a
dataset that is initially considered clean, allowing users to simulate real-world imperfections
systematically. The second mode is designed for cases where a dataset is already contaminated
but requires further corruption with specific types of errors. In such situations, the extended
methods of Pucktrick can be used to incrementally introduce additional errors until the desired
error percentage is reached. Beyond dataset contamination, Pucktrick also serves other purposes.
For instance, researchers developing data-cleaning techniques can leverage Pucktrick to inject
specific errors into a dataset and then assess whether their method efectively detects and
corrects these errors. This makes Pucktrick a versatile tool for both generating realistic training
data and benchmarking data-cleaning algorithms.</p>
      <p>Experiments show that training machine learning models on synthetic data with controlled
errors (introduced using Pucktrick) results in better accuracy compared to training on purely
synthetic, error-free data. This confirms that exposing models to realistic imperfections during
training makes them more robust and adaptable, ultimately improving their performance in
real-world scenarios.</p>
      <p>Given the importance of training machine learning models with realistic, imperfect data, in
this study, we investigate the impact of controlled dataset contamination on machine learning
model performance by analyzing:
• The efect of training on contaminated synthetic data versus purely synthetic, error-free
data,
• The influence of diferent types of data errors on model generalization and real-world
applicability,
• The extent to which systematic error introduction enhances a model’s ability to handle
naturally occurring data imperfections, and
• The potential of Pucktrick as a benchmarking tool for evaluating data-cleaning techniques
and error-handling strategies in machine learning workflows.</p>
      <p>The paper is organized as follows: section 2 reviews the state of the art in both synthetic
data generation and data contamination. Section 3 introduces the Pucktrick library and its
functionalities while section 4 outlines the experimental setup and discusses the preliminary
results. Finally, section 5 provides the conclusions and explores potential directions for future
work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>This section explores the current state of the art in SDG techniques, approaches for introducing
controlled data imperfections, and their implications for model robustness and real-world
applicability.</p>
      <sec id="sec-2-1">
        <title>2.1. Synthetic data generation algorithms</title>
        <p>
          Synthetic data generation (SDG) involves creating artificial datasets that replicate the statistical
properties of real-world data. SDG is relevant in various scenarios where real data cannot be
used for training machine learning models. In the medical domain, for instance, utilizing real
patient data is challenging due to confidentiality concerns. Similarly, in the financial sector,
datasets for fraud detection are often highly imbalanced, with a predominance of non-fraudulent
transactions. Synthetic data is obtained from a generative process based on properties of real
data. A comprehensive survey [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] analyzed 417 SDG models developed over the past decade,
highlighting the evolution of model performance and complexity. The study found that neural
network-based approaches, particularly Generative Adversarial Networks (GANs), dominate the
landscape, especially in computer vision applications. Emerging models like difusion models
and transformers are also gaining traction, ofering promising avenues for future research.
        </p>
        <p>
          The most common format for synthetic data generation is the tabular structure, which consists
of columns (also referred to as features) and rows (also called observations). According to [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] a
synthetic data generation process can be described along four diferent dimensions: Architecture
(it represents the type of data augmentation technique used), Application level (it refers to the
phase of machine learning pipeline where the process is included), Scope (it is related to the
usage of existing dataset properties), and Data Space (it is related to the representation model
used in the process). The primary metric for assessing the quality of synthetic data is its ability
to enhance machine learning model performance, typically measured through accuracy or
F1-score. Evaluating synthetic data quality before full-scale model training is crucial, as training
an ML classifier can be computationally expensive and time-consuming. By anticipating data
quality early, researchers can optimize resources and improve model eficiency.
        </p>
        <p>
          Generative Adversarial Networks (GANs) have been extensively explored for generating
synthetic tabular data. These models consist of a generator and a discriminator that engage
in a minimax game, leading to the production of data that closely resembles real datasets.
Recent studies have adapted GANs to handle the unique challenges of tabular data, such as
mixed data types and complex feature dependencies. Among others frequently used synthetic
data generation algorithm are CTGAN [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], GReaT [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], SDV [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Conditional Tabular GAN
(CTGAN) is a deep generative model specifically designed for the synthesis of tabular data.
Unlike traditional GANs, which struggle with the complex statistical properties of tabular
datasets, CTGAN introduces several innovations to enhance data fidelity. To address the issue
of managing at the same time continuous and categorical features, CTGAN employs a mode
specific normalization technique based on a Variational Gaussian Mixture Model (VGM) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
which efectively encodes continuous variables while preserving their original distribution.
CTGAN also mitigates the underrepresented class problem by incorporating a conditional
generator, ensuring that all categories are suficiently represented during training. Furthermore,
the model employs a training-by-sampling approach, where data instances are selected based
on the log-frequency of categorical values rather than uniformly. This strategy improves the
generator’s ability to produce balanced and representative synthetic samples. The network
architecture of CTGAN consists of fully connected neural networks for both the generator and
the critic. To stabilize training and improve the quality of the generated data, the model utilizes
the Wasserstein GAN with Gradient Penalty (WGAN-GP) framework [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ]. The generator
takes as input a random noise vector along with a conditional variable and produces synthetic
tabular records. The critic then evaluates these records, helping the model refine its ability to
generate realistic samples.
        </p>
        <p>
          Variational Autoencoders (VAEs) have also been applied to tabular data synthesis, focusing on
learning latent representations that capture the underlying data distribution. By sampling from
the latent space, VAEs can generate new data points that maintain the statistical properties of
the original dataset. Additionally, models like TVAE (Tabular Variational Autoencoder) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] have
been developed to specifically address the challenges of tabular data, incorporating mechanisms
to handle diverse data types and complex relationships between features.
        </p>
        <p>
          The role of data-centric AI in improving synthetic tabular data generation and evaluation is
investigated in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. Traditional approaches rely on statistical fidelity metrics to assess synthetic
data quality, but the authors argue that these methods alone are insuficient. Instead, they
propose incorporating data profiling techniques that categorize samples based on learning dificulty
to guide synthetic data generation. Through extensive benchmarking across eleven datasets
and five state-of-the-art synthetic data generators, the study demonstrates that considering
data profiles enhances model performance, model selection reliability, and feature selection
efectiveness. The findings suggest that diferent generative models excel in diferent tasks,
emphasizing the need for task-specific synthetic data evaluation.
        </p>
        <p>
          Emerging approaches have explored the use of language models for generating synthetic
tabular data. By treating rows as sequences, these models can capture dependencies between
features efectively. The GReaT model [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] utilizes large language models (LLMs) for synthetic
tabular data generation. Instead of encoding tabular data numerically, GReaT transforms each
row into a structured textual representation using a subject-predicate-object format, maintaining
semantic coherence. A pretrained transformer-based LLM is then fine-tuned on this transformed
dataset, enabling it to generate synthetic tabular samples while preserving statistical properties.
To ensure flexibility, a random feature order permutation step is introduced, preventing the
model from learning an artificial ordering among features. The model supports arbitrary
conditioning, allowing data generation based on selected feature constraints. During inference,
GReaT generates synthetic records by iteratively sampling feature values in an autoregressive
manner. The generated data is then transformed back into tabular form through
patternmatching algorithms. A python library3 is also available.
        </p>
        <p>
          The development of automated platforms for synthetic data generation has streamlined
the process of creating high-quality synthetic datasets. For example, the Synthetic Tabular
Neural Generator (STNG) [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] integrates multiple generation methods with an AutoML module,
facilitating the automatic generation of synthetic data tailored to specific tasks. This platform
addresses the need for user-friendly tools that can adapt to various data characteristics and
requirements.
        </p>
        <p>
          The Synthetic Data Vault (SDV) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is a synthetic data generation model designed to
automatically generate synthetic data, enabling data science applications while preserving privacy.
SDV builds generative models of relational databases, allowing the synthesis of artificial data
that retains the statistical and structural properties of real datasets. The system employs a
recursive modeling technique called Recursive Conditional Parameter Aggregation (RCPA),
which models the relationships between database tables to generate realistic synthetic data. SDV
has been evaluated on multiple publicly available datasets, demonstrating that synthetic data
can efectively replace real data in predictive modeling tasks. Its generative process incorporates
multivariate statistical techniques, including Gaussian Copulas, to model data distributions
3https://github.com/kathrinse/be_great
accurately. SDV is now implemented in a python library called SDV4 and it includes other
synthetic data generation models such as CTGAN, and Copulas, allowing users to choose the
most appropriate approach for their specific data characteristics.
        </p>
        <p>Assessing the quality and utility of synthetic tabular data is crucial for its adoption in practical
applications [18]. Structured evaluation frameworks have been proposed to provide a
comprehensive assessment of synthetic data generators. These frameworks consider various metrics,
including resemblance, utility, and privacy preservation, ofering a standardized approach to
evaluate and compare diferent synthetic data generation methods.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data contamination</title>
        <p>
          To the best of the authors’ knowledge, no study has attempted to define a systematic approach
for introducing errors into the dataset to realistically simulate data. In some papers, errors are
introduced to test the proposed solution only. For example, in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] authors introduced two types
of noise: table and key noise. The table noise method alters the covariance structure between
variables in the dataset. To introduce noise, the SDV modifies the covariance values   (where
 ̸= ) by halving them, efectively reducing the strength of correlations between features. This
perturbation ensures that the synthetic data maintains its general statistical properties but
with weaker dependencies, making it more distinct from the original dataset. The key noise
method afects the integrity of foreign key relationships in relational datasets. Instead of using
the inferred relationships when synthesizing child tables, the SDV randomly assigns foreign
keys to synthetic records. This disrupts the original structural dependencies within the dataset,
introducing additional randomness while preserving the overall schema. Such methods are only
used to test the quality of the proposed synthetic data generation algorithms and there are no
available noise functions in the python library.
        </p>
        <p>Numerous studies have investigated the impact of the most common types of data errors on
machine learning models, and an equally large number of approaches have been proposed to
detect and correct these errors. In [19], authors address the problem of binary classification in
the presence of random label noise, where training labels may be independently flipped with a
certain probability. The authors propose two approaches to adapt surrogate loss functions for
robust learning. These methods lead to a key result: widely used techniques such as weighted
SVMs and weighted logistic regression are provably noise-tolerant. Authors of [20] discuss a
survey of existing methods for managing the missing data problem in machine learning. In
[21] a survey of outlier detection techniques reports seven diferent classes of outlier detection
techniques for a total of 47 diferent proposed techniques.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The Pucktrick Library</title>
      <p>To systematically introduce artificial errors into datasets and simulate real-world data
imperfections, we developed Pucktrick56. This tool enables users to contaminate datasets with controlled
4https://github.com/sdv-dev/SDV
5Puck is the name of the elf in “A Midsummer Night’s Dream” by William Shakespeare, who is famous for enjoying
causing trouble and playing tricks on mortals and other fairies alike
6https://andreamaurino.github.io/pucktrick-ui-docs/
levels of errors as the ones described in Section 1, ensuring a structured and reproducible
approach to studying the efects of data corruption.</p>
      <p>The Pucktrick library is designed to introduce errors at a specified percentage, ofering two
operational modes:
• New mode: This mode introduces errors into a clean dataset, allowing users to inject a
predefined level of corruption.
• Extended mode: This mode further contaminates a dataset that has already been modified,
ensuring that additional noise is selectively introduced into previously untouched portions
while maintaining the desired overall error distribution.</p>
      <p>The library consists of five distinct modules, each designed to introduce a specific category of
data errors. These errors can be applied either at the dataset level (afecting all records) or at
the feature level (targeting specific attributes). The supported data types include:
• Categorical data: Errors such as label flipping, misspellings, and misclassifications.
• Numerical data: Noise injection, incorrect scaling, and value swapping.
• Boolean data: Random inversions of true/false values.</p>
      <p>• Date and time data: Temporal shifts, incorrect formatting, and missing timestamps.</p>
      <p>By allowing users to precisely control the level and type of errors introduced, Pucktrick serves
as a powerful tool for evaluating machine learning robustness, benchmarking data-cleaning
algorithms, and understanding the impact of real-world noise on analytical models. The library
provides a structured, repeatable, and scalable approach to data contamination, bridging the
gap between synthetic data generation and real-world dataset imperfections.</p>
      <p>In the following subsection, we introduce the Pucktrick modules. Readers seeking more
details can refer to the online documentation for further information.</p>
      <sec id="sec-3-1">
        <title>3.1. Duplicate data</title>
        <p>The duplicate module ofers functionalities for duplicating rows within a dataset, available in
both ’new’ and ’extended’ modes. This can be done either randomly across the entire dataset or
in a targeted manner, where specific rows are duplicated based on a chosen attribute value. This
feature is particularly useful for handling imbalanced datasets, where oversampling a particular
class can improve the performance of multi-class classifiers. This module allows users to test
how well machine learning models handle redundant information and potential data leakage
scenarios.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Label misclassification</title>
        <p>The label module introduces errors in target variables, simulating real-world misclassification
errors that commonly occur in practical settings [22]. In binary classification tasks, the module
allows direct label swapping (e.g., converting 0 to 1 and vice versa). For multi-class classification,
labels are randomly reassigned to a diferent class within the dataset. This feature enables
researchers to study the impact of label noise on classification performance and evaluate the
efectiveness of error correction algorithms.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Missing data</title>
        <p>The missing module includes methods for inserting "null" values in a specific column with a
predefined percentage. Missing values are a frequent challenge in real-world datasets, often
caused by sensor failures, human errors, or incomplete data collection. This module enables
controlled experimentation with various missing data scenarios, helping to assess the
performance of imputation techniques and machine learning models under diferent levels of data
incompleteness.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Noisy data</title>
        <p>The noise is one of the most complex components of the library, allowing users to introduce
random distortions into dataset features. Diferent methods are implemented depending on the
data type: For each data type outlined in Section 3, dedicated methods have been implemented.
• Continuous and discrete features: Noise is generated from a normal distribution within
the range defined by the minimum and maximum values of the feature. This simulates
real-world measurement errors or rounding imprecisions.
• Categorical integer features: Original values are replaced with alternative values randomly
sampled from the existing set, following a normal distribution to ensure realistic variation.
• Categorical string features: The module generates synthetic categories by introducing
new, normally distributed string values. This simulates scenarios where new categories
emerge in real-world data (e.g., evolving customer preferences, new product categories).</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Outlier data</title>
        <p>The outlier module generates extreme data points to simulate anomalies and rare events in a
dataset. The approach difers based on the data type:
• Continuous features: The module employs the 3-sigma method, a statistical technique
for detecting and generating outliers under the assumption of a normal distribution. The
method calculates the mean (¯ ) and standard deviation ( ) of the feature define the
outlier boundaries as:</p>
        <p>Upper Boundary = ¯  + 3,</p>
        <p>Lower Boundary = ¯  − 3
• Integer features: A similar 3-sigma methodology is applied to discrete numerical data.
• Categorical features: To introduce categorical outliers, the module simulates an unknown
category by adding values that do not exist in the dataset. These values can take the form
of:
– A new string category labeled "Puck was here", mimicking unexpected input values.
– An integer category not previously present in the dataset, representing unseen class
labels in classification tasks.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Pucktrick evaluation</title>
      <sec id="sec-4-1">
        <title>4.1. Evaluation pipeline</title>
        <p>To assess the efectiveness of the Pucktrick library, we construct a dedicated pipeline that
highlights the advantages of introducing errors into synthetically generated data. Figure 1
provides an overview of the pipeline. The provided pipeline illustrates the process of generating,
contaminating, and utilizing synthetic data for machine learning training. It begins with an
original dataset, which is split into two parts: one used for generating synthetic data and
another reserved as a test set. The synthetic data generator produces a synthetic dataset by
applying one of the synthetic data generation algorithms referenced in Section 2 to produce a
synthetic dataset that retains the statistical properties of the original data. Subsequently, this
synthetic dataset is passed through the Pucktrick library to introduce controlled noise and data
contamination as outlined in Section 3.</p>
        <p>To highlight the contribution of each type of error produced, the pipeline constructs a separate
contaminated dataset for each introduced error type (e.g., labels misclassification, outliers, etc.).
The contaminated datasets, along with the synthetic dataset, are used to train one or more
machine learning models. Once training is completed, the resulting models are then employed to
predict the correct class using the test set of the original dataset, which serves as a benchmark to
compare the model’s performance under diferent training conditions. The obtained results are
subsequently compared with the available labels in the test set to compute standard performance
metrics, such as accuracy, F1-score, precision, recall, and AUC. Finally, the diferent performance
outcomes are compared to assess the efectiveness of the proposed approach. This approach
ensures a systematic evaluation of data contamination efects on model generalization, providing
insights into how well machine learning models handle real-world noisy and imperfect data.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Dataset and experimental setup</title>
        <p>To evaluate the efectiveness of Pucktrick, we selected a diverse set of datasets related to stock
market activities spanning the years 2014 to 20187. These datasets were chosen due to their
highly dynamic nature, real-world complexity, and susceptibility to various types of data errors,
such as missing values, noise, outliers, and label misclassifications.</p>
        <p>The dataset collection consists of five diferent datasets, each containing over 200 financial
indicators commonly found in publicly traded company reports. These indicators encompass
key financial metrics such as revenue, profit margins, asset values, stock price fluctuations,
and trading volumes, ofering a comprehensive view of financial market trends. Each dataset
includes information on more than 4,000 publicly traded US stocks per year, ensuring a broad
representation of market dynamics. However, these datasets are not free from imperfections.
Certain financial indicators contain missing values, which can result from incomplete reporting
or data collection issues. Additionally, outliers are present, representing extreme values that are
likely caused by data entry errors, mistypings, or unusual market behaviors. These underlying
issues make the datasets particularly suitable for testing Puck’s ability to simulate realistic data
contamination scenarios.</p>
        <p>The dataset also includes key financial performance indicators relevant to stock trading
decisions. Among these, the second-to-last column, PRICE VAR [%], represents the percentage
price variation of each stock over the course of a given year. This metric is crucial for evaluating
a stock’s annual performance. The last column, class, provides a binary classification for each
stock based on its price variation. If the PRICE VAR [%] value is positive, the stock is assigned
CLASS = 1, indicating that it has increased in value. From a trading perspective, this signifies
that an idealized trader would have benefited from buying the stock at the beginning of the year
and selling it at the end for a profit. Conversely, if the PRICE VAR [%] value is negative, the stock
is labeled CLASS = 0, meaning its value declined over the year. In this case, a rational trader
would avoid purchasing the stock to prevent capital loss. This dataset presents a highly realistic
ifnancial scenario, where market fluctuations introduce inconsistencies and imperfections in
the data. Stock price variations depend on numerous external factors such as market sentiment,
economic conditions, corporate performance, and investor behavior, all of which contribute
to data irregularities. Moreover, the presence of missing values, outliers, and misclassified
stock performance further enhances the dataset’s suitability for testing Pucktrick ’s ability to
introduce and manage controlled data contamination.</p>
        <p>For this experiment, we considered only the top 20 features with the highest correlation
with the binary target variable. The correlation matrix of the new dataset is shown in Figure 2
highlighting the relationships between financial indicators and stock performance classification.
Additionally, the dataset contains a significant number of missing values, as shown in Figure
3, which poses challenges for machine learning models and serves as a natural test case for
assessing the impact of synthetic data contamination.</p>
        <p>
          To generate the synthetic dataset, we used the GreaT algorithm [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] through the publicly
available be_great Python library8. We employed the distilgpt2 large language model with
a batch size of 32 for 50 epochs, ensuring a robust generative process capable of mimicking
real-world financial data patterns.
        </p>
        <p>To evaluate the efects of controlled data contamination, we created 61 synthetic datasets,
categorized as follows:</p>
        <p>• One dataset with only mislabeled errors (error rate fixed at 30%).
8https://github.com/kathrinse/be_great
model Name Python reference
SVM - Linear Kernel sklearn.linear_model._stochastic_gradient.SGDC. . .
Extra Trees Classifier sklearn.ensemble._forest.ExtraTreesClassifier
Random Forest Classifier sklearn.ensemble._forest.RandomForestClassifier
K Neighbors Classifier sklearn.neighbors._classification.KNeighbors. . .
Linear Discriminant Analysis sklearn.discriminant_analysis.LinearDiscriminant. . .
Multilayer Perception Classifier sklearn.neural_network._multilayer_perceptron.. . .
Logistic Regression sklearn.linear_model._logistic.LogisticRegression
Naive Bayes sklearn.naive_bayes.GaussianNB
Decision Tree Classifier sklearn.tree._classes.DecisionTreeClassifier</p>
        <p>Quadratic Discriminant Analysis sklearn.discriminant_analysis.QuadraticDiscri. . .</p>
        <p>• One dataset for each attribute containing only missing data (error rate fixed at 30%).
• One dataset for each attribute containing only noisy data (error rate fixed at 30%).
• One dataset for each attribute containing only outlier data (using the 3 model with an
error rate fixed at 30%).</p>
        <p>To ensure a comprehensive evaluation, we utilized 10 machine learning models from diverse
algorithmic categories, including tree-based models, linear classifiers, probabilistic models,
neural networks, and nearest neighbors approaches. The specific models and their corresponding
Python implementations are detailed in Table 1. To streamline the model selection, training,
and evaluation process, we leveraged PyCaret9, an open-source, low-code Python library
that automates machine learning workflows. PyCaret simplifies the experimental pipeline
by handling data preprocessing, hyperparameter tuning, and model validation, allowing for
eficient large-scale model comparison.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results</title>
        <p>By comparing the performance of the 610 developed models (61 contaminated datasets for 10
algorithms), we obtained several key insights into the impact of synthetic data contamination on
machine learning accuracy, showing that the use of the Pucktrick library significantly enhances
the performance of machine learning algorithms compared to using only the synthetic dataset.</p>
        <p>Firstly, we address the question: Which machine learning algorithm benefits the most from the
use of the Pucktrick library?</p>
        <p>To answer this question we analyzed the percentage of models trained on contaminated
datasets that achieved a higher accuracy than their counterparts trained on purely synthetic
data. The results, summarized in Table 2, reveal key insights into how diferent machine
learning models respond to various types of data contamination. For noisy data the Support
Vector Machine (SVM) model consistently showed the highest improvement when trained on
noise-contaminated datasets across all three years (2014, 2015, and 2016). This suggests that
SVM models benefit from controlled noise introduction, possibly due to their inherent ability to
ifnd optimal decision boundaries even in the presence of variations in the feature space. Noise
may serve as a form of regularization, preventing overfitting and improving model robustness.
Regarding missing values, Extra Trees (ET) classifier exhibited the most significant improvement
when trained on datasets containing missing values. Across all years, ET was the best-performing
model for handling missing data, indicating that ensemble tree-based methods are particularly
resilient to incomplete datasets. This aligns with previous findings in the literature, as decision
trees and ensemble models are capable of handling missing data by splitting based on available
features without requiring imputation techniques. Similarly to missing values, ET consistently
outperformed other models when trained on datasets containing outliers. Interestingly, in 2015,
a linear model (ET) also showed competitive performance, suggesting that some models can
efectively learn from extreme values when trained on outlier-containing data. This further
confirms the robustness of tree-based models in handling anomalies, as they can partition the
feature space in a way that minimizes the impact of extreme values. Unlike other types of
contamination, no single model stood out as the best performer in handling mislabeled data.
The results indicate that diferent models responded diferently to label errors, with Linear
Discriminant Analysis (LDA) performing well in 2014 dataset, while Random Forest (RF) and
Extra Trees (ET) emerged as the strongest models for the 2015 and 2016 datasets, respectively.
This suggests that the impact of label misclassification is highly model-dependent, and additional
correction mechanisms such as semi-supervised learning or active learning strategies may be
needed to mitigate its efects efectively.</p>
        <p>Error Type
Noisy
Missing
Outlier
Labels</p>
        <p>A second question we address is: Which type of error, when introduced into the synthetic
dataset, improves or maintains accuracy across all classifiers and throughout all three years of
evaluation? To answer this question we calculated the percentage of models, trained with
contained dataset, with an accuracy that is higher (best) or higher or equal (best or equal) to
the same classifier trained with the synthetic dataset. The results are summarized in Table
3. Across datasets of all years, models trained on noisy data consistently outperformed or
matched those trained on purely synthetic data, with 100% for the 2014 dataset maintaining
accuracy and 90% improving. Although the efect weakened slightly for 2015 and 2016, 70% of
models still maintained or improved performance, confirming that controlled noise acts as a
form of regularization, preventing overfitting and enhancing generalization. Missing values
also had a positive efect, with 82% of models in 2014 performing at least as well as those
trained on synthetic data, though only 44% showed direct improvements. Similar trends were
observed in 2015 and 2016, where 75% maintained accuracy, but only 27–35% improved. This
suggests that while some models can adapt to missing data, others require feature engineering
or imputation techniques to perform optimally. The impact of outliers was moderate, with
70–79% of models across all years maintaining or improving accuracy, while 27–39% experienced
direct gains. Though not as influential as noise, outliers were better handled by tree-based
models, reinforcing their resilience to high-variance data. Label misclassification had the most
unpredictable efect. While 90% of models maintained or improved accuracy, only 10–20% saw
actual performance gains, indicating that some models can tolerate label noise, while others are
significantly afected. The inconsistency suggests that additional correction mechanisms, such
as semi-supervised learning or label adjustment strategies, may be necessary to mitigate its
impact.</p>
        <p>Error Type
Labels
Outlier
Noisy
Missing</p>
        <p>As shown in Table 4, Puck enhances the performance of Extra Trees and Random Forests
in over 90% of experiments, demonstrating its efectiveness in improving tree-based models.
Furthermore, Pucktrick consistently improves or maintains the accuracy of models trained on
the original synthetic dataset, ensuring that data contamination does not degrade performance.
In general, the results confirm that Pucktrick positively impacts or preserves accuracy across
all major machine learning algorithm categories, including tree-based, linear, neural, and
neighbor-based models.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>Training machine learning models with real-world data is often dificult due to privacy and
confidentiality restrictions, limiting the ability to develop classifiers using actual datasets. To
address this, various synthetic data generation algorithms have been developed over the years,
aiming to replicate the statistical properties of real data. However, a critical limitation of
synthetic datasets is their lack of real-world imperfections. Unlike actual datasets, synthetic
data is often too clean, as it does not contain common data issues such as missing values, outliers,
and label misclassification, which are inherent in real-world applications. In this paper we
introduced Pucktrick a library designed to systematically introduce a controlled percentage
of errors into datasets, either at the feature level or across the entire dataset. Experimental
evaluations on three real-world datasets confirm the efectiveness of this approach. Notably,
linear models such as SVMs and tree-based models demonstrate improved accuracy when
trained on controlled-error datasets, suggesting that strategic contamination can enhance model
robustness.</p>
      <p>There are several directions for future research. One key focus is the development of an
enhanced version of the Pucktrick library that supports a more generalized strategy for error
insertion, allowing for greater flexibility in simulating real-world data imperfections. Additionally,
we are exploring methods to modify dataset schemas rather than only altering data values. A
common challenge arises when the schema of a dataset used in a machine learning model difers
from that of the training dataset. For instance, a feature such as total living area in the training
data may later be divided into interior area and exterior area in a real-world application, leading
to a structural mismatch that could afect model performance. Addressing such discrepancies
will be crucial for ensuring model adaptability across diferent data environments.</p>
      <p>Furthermore, additional experiments are needed to gain deeper insights into the relationship
between error types, model architectures, and feature selection. A key objective is to help users
identify the minimum set of essential features required to improve classifier accuracy, even when
working with imperfect datasets. By refining these aspects, we aim to make Pucktrick a more
comprehensive tool for evaluating machine learning models under realistic data conditions.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
[18] Q. Liu, M. Khalil, J. Jovanovic, R. Shakya, Scaling while privacy preserving: A
comprehensive synthetic tabular data generation and evaluation in learning analytics, in: Proceedings
of the 14th Learning Analytics and Knowledge Conference, 2024, pp. 620–631.
[19] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, A. Tewari, Learning with noisy
labels, in: C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K. Weinberger
(Eds.), Advances in Neural Information Processing Systems, volume 26, Curran
Associates, Inc., 2013. URL: https://proceedings.neurips.cc/paper_files/paper/2013/file/
3871bd64012152bfb53fdf04b401193f-Paper.pdf.
[20] T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, E. Tabane, M. Velempini, A survey
on missing data in machine learning, Journal of Big Data 8 (2021) 140. URL: https://doi.
org/10.1186/s40537-021-00516-9. doi:10.1186/s40537-021-00516-9.
[21] A. Boukerche, L. Zheng, O. Alfandi, Outlier detection: Methods, models, and
classification, ACM Computing Surveys (CSUR) 53 (2020) 55:1–55:37. URL: https://doi.org/10.1145/
3381028. doi:10.1145/3381028.
[22] B. Frénay, A. Kabán, et al., A comprehensive introduction to label noise., in: ESANN,
Citeseer, 2014.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paleyes</surname>
          </string-name>
          , R.-G. Urma,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <article-title>Challenges in deploying machine learning: a survey of case studies</article-title>
          ,
          <source>ACM computing surveys 55</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Figueira</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Vaz,</surname>
          </string-name>
          <article-title>Survey on synthetic data generation, evaluation methods and gans</article-title>
          ,
          <source>Mathematics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>2733</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>K. S</surname>
          </string-name>
          , M. Durgadevi,
          <article-title>Generative adversarial network (gan): a general review on diferent variants of gan and applications</article-title>
          ,
          <source>in: 2021 6th International Conference on Communication and Electronics Systems (ICCES)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICCES51350.
          <year>2021</year>
          .
          <volume>9489160</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El-Sayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Peterson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mahmood</surname>
          </string-name>
          ,
          <article-title>Variations in variational autoencoders - a comparative evaluation, IEEE Access 8 (</article-title>
          <year>2020</year>
          )
          <fpage>153651</fpage>
          -
          <lpage>153670</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ACCESS.
          <year>2020</year>
          .
          <volume>3018151</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Panagiotakos</surname>
          </string-name>
          ,
          <article-title>Real-world data: a brief review of the methods, applications, challenges and opportunities</article-title>
          ,
          <source>BMC Medical Research Methodology</source>
          <volume>22</volume>
          (
          <year>2022</year>
          )
          <fpage>287</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Iskander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Karnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Shapira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tolmach</surname>
          </string-name>
          ,
          <article-title>Quality matters: Evaluating synthetic data for tool-using llms</article-title>
          ,
          <source>arXiv preprint arXiv:2409.16341</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          , W. Chen,
          <article-title>Unveiling the flaws: Exploring imperfections in synthetic data and mitigation strategies for large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2406.12397</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Trapp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stenger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Leppich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kounev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Leznik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chard</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Foster</surname>
          </string-name>
          ,
          <article-title>Comprehensive exploration of synthetic data generation: A survey</article-title>
          ,
          <source>arXiv preprint arXiv:2401.02524</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fonseca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bacao</surname>
          </string-name>
          ,
          <article-title>Tabular and latent space synthetic data generation: a literature review</article-title>
          ,
          <source>Journal of Big Data</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <article-title>115</article-title>
          . URL: https://doi.org/10.1186/s40537-023-00792-7. doi:
          <volume>10</volume>
          .1186/s40537-023-00792-7.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Skoularidou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cuesta-Infante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Veeramachaneni</surname>
          </string-name>
          ,
          <article-title>Modeling tabular data using conditional gan</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems (NeurIPS)</source>
          , volume
          <volume>32</volume>
          ,
          <year>2019</year>
          , p.
          <fpage>1049</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Borisov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Seßler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Leemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pawelczyk</surname>
          </string-name>
          , G. Kasneci,
          <article-title>Language models are realistic tabular data generators</article-title>
          ,
          <source>in: The Eleventh International Conference on Learning Representations, ICLR</source>
          <year>2023</year>
          , Kigali, Rwanda, May 1-
          <issue>5</issue>
          ,
          <year>2023</year>
          , OpenReview.net,
          <year>2023</year>
          . URL: https://openreview.net/forum?id=cEygmQNOeI.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Patki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wedge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Veeramachaneni</surname>
          </string-name>
          ,
          <article-title>The synthetic data vault, in: 2016 IEEE international conference on data science and advanced analytics (DSAA)</article-title>
          , IEEE,
          <year>2016</year>
          , pp.
          <fpage>399</fpage>
          -
          <lpage>410</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Nasios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bors</surname>
          </string-name>
          ,
          <article-title>Variational learning for gaussian mixture models</article-title>
          ,
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>B</given-names>
          </string-name>
          (
          <year>Cybernetics</year>
          )
          <volume>36</volume>
          (
          <year>2006</year>
          )
          <fpage>849</fpage>
          -
          <lpage>862</lpage>
          . doi:
          <volume>10</volume>
          .1109/ TSMCB.
          <year>2006</year>
          .
          <volume>872273</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Walia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tierney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mckeever</surname>
          </string-name>
          ,
          <article-title>Synthesising tabular data using wasserstein conditional gans with gradient penalty (wcgan-gp)</article-title>
          ,
          <source>in: Irish Conference on Artificial Intelligence and Cognitive Science</source>
          ,
          <year>2020</year>
          . URL: https://api.semanticscholar.org/CorpusID:229345165.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T.</given-names>
            <surname>Milne</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. I. Nachman</surname>
          </string-name>
          ,
          <article-title>Wasserstein gans with gradient penalty compute congested transport</article-title>
          , in: P.
          <string-name>
            <surname>-L. Loh</surname>
          </string-name>
          , M. Raginsky (Eds.),
          <source>Proceedings of Thirty Fifth Conference on Learning Theory</source>
          , volume
          <volume>178</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>129</lpage>
          . URL: https://proceedings.mlr.press/v178/milne22a.html.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Seedat</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>van der Schaar, A. Petrovic, Reimagining synthetic tabular data generation through data-centric ai: A comprehensive benchmark</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2023</year>
          )
          <fpage>33781</fpage>
          -
          <lpage>33823</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Rashidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Albahra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>A novel and fully automated platform for synthetic tabular data generation and validation</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <fpage>23312</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>