<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Dividi et Impera: Enhancing Synthetic Data Fidelity through Data Partitioning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yordanos Nebiyou Yifru</string-name>
          <email>yordanos.yifru@studenti.unime.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Salvatore Distefano</string-name>
          <email>salvatore.distefano@unime.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fidelity</institution>
          ,
          <addr-line>Gaussian Copulas, Generative Models</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Synthetic Data</institution>
          ,
          <addr-line>Data Quality, Data Balancing, Undersampling, Machine Learning, Privacy, Data Utility, Data</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>9</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>Data-intensive application utility is often hampered by data scarcity, quality, heterogeneity, and privacy constraints. This paper introduces a novel, modular framework for synthetic data generation, centered on a preliminary data partitioning stage. The proposed approach decomposes complex datasets into homogeneous subsets, enabling local generative models to capture intricate data characteristics. Our instantiation, Partitioned Gaussian Copula (PGC), combines Hierarchical Variance-Entropy Based Partitioning (HVEP) with Gaussian copulas. Experiments on geometric and real-world tabular datasets demonstrate the PGC efectiveness by machine learning utility and statistical fidelity compared to global models and baselines (CTGAN, TVAE). While PGC shows a slight trade-of in privacy metrics due to localized modeling, its enhanced data representation for complex distributions underscores partitioning as a critical architectural improvement in synthetic data generation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Synthetic data broadly encompasses partially synthetic data (where sensitive values are replaced),
hybrid synthetic data (combining real and synthetic components), and fully synthetic data (entirely
artificial datasets preserving statistical properties). Its generation leverages diverse methodologies,
including advanced Generative AI Models like GANs, VAEs, and Transformers, alongside statistical
modeling and rule-based systems.</p>
      <p>The core contribution of this paper lies in introducing a preliminary data partitioning stage within
the synthetic data generation process. We propose a novel, general framework that segments datasets
into more homogeneous subsets based on statistical or semantic criteria. This methodological shift
is central to our approach; rather than training a single, monolithic generative model on an entire,
potentially complex dataset, our framework enables the fitting of local generative models to each
individual partition. This preliminary partitioning significantly enhances the capacity of existing
generative techniques to capture intricate, nonlinear, and localized data dependencies more accurately.</p>
      <p>As a concrete instantiation, we apply this framework to Gaussian copulas. This demonstrates a
remarkable improvement in their ability to model complex relationships that typically exceed their</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
expressive power, leading to enhanced performance even on challenging data structures. Our approach
is characterized by a simple yet efective hierarchical, axis-aligned binary splitting algorithm, guided by
variance reduction, which maintains interpretability and computational eficiency. Empirical evaluation
across various tabular datasets confirms that this partition-based generation framework yields improved
synthetic data utility and competitive privacy guarantees with minimal additional computational
overhead, making it a highly efective solution to prevalent data challenges.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Synthetic Data</title>
      <p>
        Although synthetic data attracted significant attention in recent years, there is no universally accepted
definition in the literature. To cover the diverse range of applications and generation methods, we
adopt the following definition inspired by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Definition 1. Synthetic data is data that has been generated using a purpose-built mathematical model or
algorithm, with the aim of solving a (set of) data science task(s).</p>
      <p>
        Synthetic data is generated by a model, often with the purpose of using it instead of real data.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] If
used responsibly, synthetic data promises to enable learning across datasets when the privacy of the
data needs to be preserved; or when data is incomplete, scarce or biased.[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
      </p>
      <p>One of the most prevalent modalities for synthetic data is image data.Synthetic image data is defined as
any image data that is either artificially created by modifying real image data or captured from synthetic
environments[4]. The need for robust image data sets for algorithm development and testing has
prompted the consideration of synthetic imagery as a supplement to real imagery[5].Image recognition
is one of the most promising applications of synthetic image data. large-scale annotated data has
revolutionized the field of image recognition. However, it is costly and time-consuming to manually
collect a largescale labeled dataset, and recent concerns about data privacy and usage rights further
hinder this process[6]. The key advantage of synthesising image data, and the primary reason that
makes the generation of data faster and cheaper, is that a properly set up image synthesis pipeline is
capable of automating the data generation and labelling process at a comparatively low cost to manual
labour[4].</p>
      <p>There are many means to generate synthetic image data with diferent methodologies suited to
diferent tasks and applications [ 4] such as Manual Generation,Generative Adversarial Networks,
Variational Autoencoders(VAEs),Hybrid Networks3D Morphable Models, Parametric Models and so
on. VAEs[7] consist of two neural networks and represent a step forward from classical autoencoders.
VAE networks have been used in image denoising[8], and image compression [9] but limited for image
generation on its own due to its its blurry output. research on the use of VAEs for image generation has
transitioned to hybrid models that use both VAEs and GANs[4]. GAN model[10] uses two networks—a
generator and a discriminator—that compete adversarially, with the generator aiming to fool the
discriminator by producing realistic synthetic data. The more well known GAN models used to generate
image data are BigGAN, CycleGAN, DALL-E 2 (and similar models), DCGAN, and StyleGAN [4]. Overall,
GANs have multiple notable use cases in synthetic data generation. The most common use case is to
generate data for training other networks[4].</p>
      <p>
        The other modalities for synthetic data is text data.Text generation is the task of automatically
generating texts, which maintain specific properties of real texts[ 11]. Machine-learning-powered text
classification models have been widely applied in diverse applications such as detecting biased or toxic
language on online platforms and filtering spam emails[ 12]. The performance of these models is directly
related to the volume and quality of the data we have.This poses a huge challenge, as the training
data collection and curation process is often costly, time-consuming, and complex[ 12]. In addition,
the issue of privacy has gained increasing attention in natural language processing (NLP)[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].These
makes synthetic text generation an increasingly valuable approach, as it allows for the creation of large,
diverse, and privacy-preserving datasets that can be used to train and evaluate NLP models efectively.
      </p>
      <p>In the past models such as latent dirichlet allocation (LDA), Markov chains (MC) and hidden Markov
model (HMM) have been used for generating syntetic text data. However, with the recent advancements
in large language models (LLMs), researchers have started to explore the potential of utilizing LLMs for
generating synthetic data tailored to specific tasks and augmenting the training data in low resourced
data settings[12].</p>
      <p>The final and less explored modalities for synthetic data is tabular data. synthetic tabular data
Denotes data that is synthetically produced to replicate the structure and statistical characteristics of
real-world tabular data, which is commonly arranged in rows and columns. In this format, columns
correspond to features or attributes, such as categorical, numerical, or ordinal variables, while rows
represent individual data entries or observations.</p>
      <p>When working with scarce or imbalanced datasets generative modeling can be used to augment the
data by creating synthetic data points that fill in the gaps. This can help to improve the performance of
machine learning models, as they will have more data to train on [13].</p>
      <p>Several models have been used for the generation of synthetic tabular data. The traditional approach
to the generation of synthetic tabular data includes Bayesian networks[14, 15] and copulas [16, 17].
The modern method for modeling tabular models is based on deep learning. Conditional Tabular
Generative Adversarial Network (CTGAN)[18] is a deep learning method to model tabular data. It uses
a conditional GAN to capture complex non-linear relationships[19]. [20] introduced the most optimized
version of conditional GAN for tabular data.The Tabular Variational Autoencoder (TVAE)[18] is a type
of Variational Autoencoder for modeing tabular data.</p>
      <p>
        Synthetic data are being used as a solution to a variety of problems in many domains[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] such as data
exchange while preserving privacy, data debiasing for fairness, and data augmentation.
      </p>
      <p>
        The wide adoption of data-driven machine learning solutions as the prevailing approach to innovate
has created a need to share data[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. When privacy preservation is one of the goal during data sharing,
synthetic data ofers a potential solution.
      </p>
      <p>Data-driven algorithms are only as good as the data they work with, while data sets, especially social
data, often fail to represent minorities adequately[ 21].These algorithms may under-perform if trained
on pre-existing biases which lay inside data distributions[22]. Showed that data augmentation can
reduce classification error for discriminated groups. [ 23] Further demonstrate the large potential of
synthetic data for analyzing and reducing the negative efects of dataset bias on deep face recognition
systems.</p>
      <p>Data augmentation using synthetic data has emerged as a promising strategy to address challenges
related to data scarcity, bias, and fairness in machine learning. By generating realistic yet artificial
samples, synthetic data helps expand limited datasets and improves model generalization, especially for
underrepresented groups.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>This paper proposes a modular framework for synthetic data generation that combines data partitioning
and synthetic data generation models within these partitions to address the challenges of heterogeneity
and complexity in datasets. The main idea is to decompose the input dataset into smaller subsets and
train dedicated generative models on them, thereby enabling the modeling process to adapt to local
data characteristics. The synthetic data generation workflow, shown in Figure 1, relies fundamentally
on few stages, i.e., i) exploratory data analysis and preprocessing, ii) partitioning (algorithm selection
and application) and iii) synthetic data generation (model selection and application on the partitions).
Once generated, the synthetic data quality is assessed, and if it does not meet the quality criteria, the
process restarts from partitioning.</p>
      <sec id="sec-3-1">
        <title>3.1. Exploratory Data Analysis</title>
        <p>Exploratory Data Analysis (EDA) is a critical initial step that involves understanding the structure,
distribution, and quality of the data set before synthetic data generation. By examining data types,
identifying patterns or anomalies and assessing diversity and complexity, EDA guides the selection of
Exploratory Data Analysis</p>
        <p>Data preprocessing
Partition Algorithm Selection</p>
        <p>Partition Dataset</p>
        <p>Partitions
Partition 1</p>
        <p>Partition 2</p>
        <p>Partition n</p>
        <p>Synthetic data generation models Selection per Partition
Select Model for Partition 1</p>
        <p>Select Model for Partition 2</p>
        <p>Select Model for Partition n</p>
        <p>Synthetic data generation models Training and Sample Generation
Train Model on Partition 1</p>
        <p>Train Model on Partition 2</p>
        <p>Train Model on Partition n</p>
        <p>No
Generate Samples from Model 1</p>
        <p>Generate Samples from Model 2</p>
        <p>Generate Samples from Model n
Combine Samples into Final Dataset</p>
        <p>Quality Assessment Check
Return Synthetic Dataset</p>
        <p>Yes</p>
        <p>End
partitioning strategies and suitable generative models. Skipping this step risks applying inappropriate
models that may generate unrealistic or biased synthetic data.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Preprocessing</title>
        <p>Data preprocessing prepares the data set for partitioning and modeling by addressing quality and
consistency issues. This includes handling missing values, encoding categorical features, detecting and
removing outliers, and balancing imbalanced datasets. These steps reduce noise while ensuring that
the input data align with the model requirements, ultimately enhancing the fidelity and usefulness of
the generated synthetic data.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Partition algorithm selection</title>
        <p>In These step frameworks involve selecting an appropriate partitioning algorithm.</p>
        <p>A Partitioning algorithm  can be defined as a function
 ∶  → 
1,  2, … ,  
mapping from the input dataset  to a set of subsets { 1,  2, … ,   }, where ⋃=1   ⊆  and the number
of partitions  can be specified by the user or adaptively determined by the partitioning algorithm
based on the data characteristics. Partitioning the dataset  into subsets { 1,  2, … ,   } aims to isolate
regions of the data that exhibit relatively homogeneous characteristics. This simplifies modeling and
improves synthetic data quality by allowing models to capture local patterns more efectively.</p>
        <p>The choice of a partitioning strategy is highly dependent on the intrinsic properties of the dataset  .
Key characteristics influencing this decision include:

• Data type and modality: Whether the data is tabular, image-based, text, time-series, or
multimodal can significantly afect the suitable partitioning methods.
• Statistical properties: Measures such as variance, entropy, correlation between features, or
class imbalance provide insight into the complexity and heterogeneity of the data.
• Structural characteristics: The presence of clusters, hierarchical groupings, or other latent
structures in the data guides the partitioning approach.
• Dimensionality: High-dimensional data may require dimensionality reduction or specialized
partitioning algorithms to handle sparsity and noise.</p>
        <p>Efective partitioning is central to improving the quality of synthetic data generation, as it
enables localized modeling within more homogeneous data subsets. For tabular data, partitions can be
created using categorical splits, decision trees, or correlation-based clustering in rank space. These
approaches isolate subpopulations with simpler statistical properties, improving model accuracy.
Advanced techniques like PCA-based variance reduction and latent space clustering (e.g., via VAEs) further
capture complex dependencies by transforming the data into representations where partitions are more
meaningful.</p>
        <p>For image datasets, partitioning leverages the rich semantic information captured by pretrained CNNs
such as ResNet to generate compact embeddings. Clustering in this feature space forms semantically
coherent groups suitable for local generation. Hierarchical partitioning or label-based segmentation
also helps preserve semantic granularity—e.g., dividing images first by broad classes and then refining
into subcategories. When performed in learned latent spaces, clustering can uncover intricate visual
structures not easily detected through pixel-level analysis.</p>
        <p>Text partitioning utilizes contextual embeddings from transformer models like BERT to group text
by semantic similarity. This enables the discovery of topic-based clusters for more focused modeling.
Additionally, linguistic features such as syntax or sentence complexity can inform partitions along
stylistic or structural lines. Together, semantic and linguistic partitioning strategies produce coherent
textual subsets, enhancing the performance and interpretability of generative models tailored to diverse
language use cases.</p>
        <p>Although the proposed partition-based framework can be applied across diverse data modalities,
the choice of partition strategy must be tailored to the structure and semantics of the specific data
type. To this end, we introduce a custom Hierarchical variance-entropy based partitioning(HVEP)
algorithm, specifically designed for tabular data sets. The HVEP algorithm recursively partitions the
data into smaller and more homogeneous subsets using axis-aligned binary splits driven by maximizing
an aggregated gain metric across all features (e.g., variance reduction for continuous variables or
information gain for categorical ones). It builds a binary tree where each internal node corresponds to a
split on one feature, and each leaf node represents a final partition that is simple enough to be modeled
independently.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Synthetic data generation</title>
        <p>This process includes the main steps for synthetic data generation, and it is driven by a set of generative
functions or models ℳ trained independently on each subset   and then generating synthetic samples
 ̃  as shown in Figure1. The ℳ models can be any probabilistic or neural generative model (e.g.,
Gaussian copula, VAE, GAN), depending on the application requirements. The choice of generative
model in our framework is directly informed by the nature of the partitions created during the data
partition phase. Diferent partitioning strategies yield subsets with varied statistical properties—such
as lower variance, more homogeneous correlations, or class-specific features—that enable even simple
models to efectively learn from localized data. This piecewise modeling approach leverages the fact
that many generative models, particularly those that assume linearity or unimodality, perform poorly
on globally complex data but succeed when applied to well-chosen subregions.</p>
        <p>For example, Gaussian Copulas assume linear dependencies after Gaussian transformation of
marginals and therefore work best on partitions where such linearity holds. Efective partitioning
techniques include variance reduction (e.g., PCA + k-means), conditioning on discrete variables (like
gender or product category), and tree-based splits that isolate local linear relationships. By aligning the
partitions with the copula modeling assumptions, we enable it to capture complex global structures
through a composition of locally linear models.</p>
        <p>Similarly, Conditional GANs (CGANs) benefit from label-based conditioning, but the efectiveness of
this conditioning depends on the granularity of labels. Coarse labels often miss intra-class variation,
so our framework introduces hierarchical conditioning, where data is either further clustered within
existing labels or pseudo-labeled when no explicit classes exist. This allows CGANs to learn fine-grained
conditional distributions, improving diversity and realism of generated data while also helping with
class imbalance.</p>
        <p>Tabular VAEs (TVAE), while designed for structured data, tend to average out distinct subpopulation
behaviors when trained globally. Partitioning the data first—via k-means or Gaussian Mixture
Models—allows training of one TVAE per cluster, preserving subgroup-specific dynamics. This ensures that
even rare or behaviorally distinct groups are represented accurately in the synthetic data.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Quality assessment check</title>
        <p>After generating synthetic data for each partition—where a separate model is trained per region—we
merge the outputs into a unified synthetic dataset. To ensure the overall quality, we apply a
postgeneration quality assessment step. If the generated data fails to meet the desired quality thresholds,
we return to the partition selection phase and refine the partitioning strategy. The specific assessment
criteria heavily depend on the intended use case. In privacy-sensitive applications, metrics such as
Nearest-Neighbor Distance Ratio (NNDR) or Distance to Closest Record (DCR) can be used to evaluate
disclosure risks. These assess how distinguishable synthetic records are from real ones, helping identify
potential privacy leaks.</p>
        <p>When statistical fidelity is the primary concern, we can measure how well the synthetic data preserve
important distributional properties. Marginal distributions can be compared using the Wasserstein
distance or Jensen-Shannon divergence, while the correlation distance can be used to assess pairwise
dependencies. In use cases where synthetic data are intended to support downstream machine learning
tasks, model-centric metrics such as accuracy, F1 score, or AUC can be used. These evaluate whether
models trained on synthetic data generalize well to real data, serving as an indirect but practical measure
of utility.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Case study and experiments</title>
      <p>In this section, we present a comprehensive evaluation of our modular synthetic data generation
framework. The framework allows for flexible instantiation by combining a partitioning algorithm
with a generative model. In the experiments, a Hierarchical Variance-Entropy Based Partitioning
(HVEP) algorithm and Gaussian copulas have been adopted as the base generative model. We refer
to this instantiation as a Partitioned Gaussian Copula (PGC). For comparison, we also evaluated the
performance of a global Gaussian copula (GGC) model that applies the same generative model without
partitioning. In addition, we include other standard baselines to provide a broader context and assess
general performance across multiple axes.</p>
      <p>Our experiments evaluate the efectiveness of partition-based synthetic data generation using
Gaussian copulas across two fronts. First, we use geometric synthetic datasets with known nonlinear
patterns—such as spirals and trefoils—to visually assess how well diferent models preserve geometric
and topological structures. Second, we apply our approach to four real-world tabular datasets (Adult,
Loan, Breast Cancer, and Intrusion) to measure performance across machine learning utility (accuracy,
F1-score, AUC), statistical similarity (Wasserstein distance, Jensen-Shannon divergence, correlation
distance), and privacy metrics (Nearest Neighbor Distance Ratio and Distance to Closest Record). This
comprehensive evaluation reveals the strengths and limitations of our modular framework in capturing
complex dependencies while balancing utility, fidelity, and privacy. The synthetic datasets, notebooks,
source codes and all the experiments artifacts are available on demand.</p>
      <sec id="sec-4-1">
        <title>4.1. Geometric datasets for structural fidelity</title>
        <p>We used two synthetic 3D datasets, spiral and Trefoil Knot, to evaluate structural fidelity. The Spiral
dataset consists of two intertwined spiral arms in three-dimensional space (x, y, z), forming a classic
nonlinear manifold that challenges models to preserve spatial continuity and curvature. The Trefoil Knot,
a single loop knotted curve, introduces a topological challenge because of its non-trivial structure and
complex dependencies. Both datasets contain 10,000 samples and serve as visually intuitive benchmarks
for assessing a model ability to capture nonlinear and topologically rich patterns.</p>
        <p>To evaluate the fidelity of the generated data, we visualized the output of various models: PGC, GGC,
CTGAN, TVAE, and CopulaGAN, along with the original 3D data. All neural-based models were trained
for 150 epochs. A successful generative model is expected to preserve the global shape, continuity, and
density of the original data. Models that fail to do so often sufer from inadequate spatial modeling,
limited representational capacity, or overly simplified learning that neglects global structure.</p>
        <p>These qualitative observations are supported by quantitative metrics that help to assess how well a
model approximates the true data distribution. We report statistical measures including marginal and
correlation similarity, Jensen-Shannon divergence (JSD), and Kolmogorov-Smirnov (KS) divergence.
Together, these tools ofer a comprehensive evaluation of structural fidelity in the generation of synthetic
data.
4.1.1. Trefoil
The results shown in Table 1 and Figure 2 indicate that the PGC achieves the lowest Wasserstein distance
and the correlation distance, suggesting a superior preservation of both marginal distributions and
feature dependencies in the Trefoil dataset. GGC or neural-based generative models show higher distances,
reflecting a less accurate fit to the complex structure of the data. This highlights the efectiveness of
partitioning strategies in improving synthetic data quality for complex non-linear datasets.
4.1.2. Spiral
The spiral dataset results shown in Table 2 and Figure 3 confirm the PGC efectiveness, which achieves
the lowest Wasserstein distance and the best preservation of correlations, matching the GGC but
outperforming it in marginal distribution fidelity. Neural-based models such as CopulaGAN, CTGAN,
(d) CTGAN
(e) TVAE
(f) CopulaGAN
and TVAE exhibit higher correlation distances, indicating challenges in accurately capturing the complex
dependencies inherent in the spiral structure. In general, these findings underscore the benefits of
partitioning for modeling intricate non-linear data distributions.
4.2. Real-World Datasets for Utility, Privacy, and Statistical Similarity
We evaluated our method on four real-world tabular datasets. UCI Adult, Breast Cancer Wisconsin,
Personal Loan, and KDD Cup 1999. These datasets difer in size and dimensionality, ofering a broad
testbed for assessing generative performance. The UCI Adult dataset1 contains 48,842 records and
14 attributes, and consists of anonymous people information such as occupation, age, native country,
race, capital gain, capital loss, education, work class and more. The Breast Cancer Wisconsin dataset2,
with 569 rows and 30 features computed from a digitized image of a fine needle aspirate of a breast
mass describing characteristics of the cell nuclei present in the image. The Personal Loan dataset3
includes 5,000 records and 12 attributes, including customer demographic information (age, income,
1https://www.kaggle.com/datasets/sagnikpatra/uci-adult-census-data-dataset
2https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
3https://www.kaggle.com/datasets/teertha/personal-loan-modeling
(d) CTGAN
(e) TVAE
(f) CopulaGAN
etc.), relationship with the bank (mortgage, securities account, etc.), and response to the last personal
loan campaign. The KDD Cup 1999 dataset4, comprising 50,000 records and 41 features, describing
network trafic for intrusion detection. The number of training epochs is chosen based on the size and
complexity of each dataset, allowing fair comparison between models by ensuring suficient training
without overfitting or underfitting.</p>
        <p>For each dataset, we compare the partitioned Gaussian copula (PGC) against its counterpart without
partition (GGC). We benchmark it also against established baselines, including CTGAN, TVAE, and
CopulaGAN.
4.2.1. Assessment metrics
The evaluation is conducted on three dimensions: (1) machine learning (ML) utility, (2) statistical
similarity and (3) privacy preservability. Each dimension ofers insights into diferent aspects of data
quality and risk. We assess the machine learning utility of the synthetic data by comparing how well
it supports downstream model training in comparison to the real dataset. In short, we train standard
classifiers on both the real and synthetic training sets and evaluate them on a shared real test set. This
setup allows us to observe whether models trained on synthetic data can generalize similarly to those
trained on original data. Performance is reported using accuracy, F1-score, and AUC, enabling a fair
and interpretable comparison. To perform this evaluation, we first partition the original data set into two
parts: training and testing. The training portion is then used as input to the synthetic data generation
model, which produces a dataset of equal size, but without exposing real records. We then train 5
machine learning algorithms, decision trees, logistic regression, and MLP,SVM and Random Forest
independent of both real and synthetic training sets. In both cases, the evaluation is performed on the
same real test set. This setup provides direct insight into whether the knowledge learned from synthetic
data is transferable to real-world data distributions. If model performance on the synthetic-trained
models is close to the real-trained ones, the generated data are considered highly useful.
4https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html</p>
        <p>To evaluate the statistical similarity, we compute multiple distance-based metrics between the
synthetic and real data distributions. For continuous variables, we use the Wasserstein distance
to quantify how well the empirical distributions align. For categorical attributes, we apply
JensenShannon Divergence, a symmetric and bounded variant of KL divergence, which assesses the overlap
between probability distributions. Furthermore, we measure the correlation distance between the
feature pair relationships by comparing the absolute diferences between the two correlation matrices.</p>
        <p>To ensure privacy is not compromised, we use two proximity-based metrics. The Nearest-Neighbor
Distance Ratio (NNDR) compares how close each synthetic point is to its closest real neighbor versus
its closest synthetic neighbor. Higher ratios suggest that synthetic points are more embedded within
the synthetic distribution, reducing memorization risk. The distance from the closest record (DCR)
captures the minimum distance from each synthetic sample to any real data point, with higher values
indicating better privacy protection. Together, NNDR and DCR ofer a robust assessment of whether
synthetic data avoid overly replicating or leaking sensitive information from the original dataset.</p>
        <p>The experiment is run 3 times and the average is taken to avoid random fluctuations or measurement
noise afecting the results.
4.2.2. Results</p>
        <p>Table 3 shows the averaged diferences in ML utility between real and synthetic data in terms of
accuracy, F1 score, and AUC. Better synthetic data are expected to have low diferences. The results
confirm that partitioning enhances the utility of generative models. The PGC model outperformed its
counterpart the GGC achieving an average gain of approximately + 43 percentage points (pp) in F1. This
substantial improvement highlights the strong compatibility of Gaussian copulas with partitioning. PGC
also outperformed well known GAN and VAE based baselines which shows how efective partitioning
can be.</p>
        <p>Table 4 summarizes statistical similarity metrics - averaged across all data sets. We can see that the
PGC consistently achieves the best performance in all statistical metrics, indicating that partitioning
preserves the marginal and correlation structure most efectively in this setting.</p>
        <p>Table 5 presents privacy evaluation metrics. These metrics assess how distinguishable synthetic
records are from real ones - higher values generally suggest better privacy. As expected, the experimental
results show that PGC generally exhibits less privacy compared to GGC, since data partitioning often
leads to smaller training subsets, which can increase the risk that synthetic samples resemble real
records too closely.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This work addresses key challenges in synthetic data generation by proposing a modular framework
incorporating a preliminary data partitioning stage. This core contribution enhances generative model
capacity by allowing local modeling of homogeneous data subsets. The Partitioned Gaussian Copula
(PGC) instantiation, leveraging HVEP and Gaussian copulas, consistently demonstrated superior
machine learning utility and statistical fidelity on both complex geometric and diverse real-world tabular
datasets, significantly outperforming Global Gaussian copulas (GGC) and established baselines. This
confirms that partitioning efectively enables generative models to capture intricate, nonlinear data
dependencies. While localized modeling in PGC led to an expected, marginal trade-of in privacy metrics
compared to global approaches, the substantial gains in utility and fidelity underscore the framework
efectiveness. Future research will integrate formal diferential privacy mechanisms within partitions
and explore adaptive partitioning for various data modalities.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has also been supported by the European Union - Next Generation EU under the Italian
National Recovery and Resilience Plan (NRRP), Mission 4, Component 2, Investment 1.3, project
3DSEECSDE, CUP J33C22002810001, partnership on “SEcurity and RIghts in the CyberSpace” (PE00000014
- program “SERICS”).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT for grammar and spelling check.
After using these tools/services, the authors reviewed and edited the content as needed and take full
responsibility for the publication content.
[4] K. Man, J. Chahl, A review of synthetic image data and its use in computer vision, Journal of</p>
      <p>Imaging 8 (2022) 310.
[5] J. R. Schott, S. D. Brown, R. V. Raqueno, H. N. Gross, G. Robinson, An advanced synthetic image
generation model and its application to multi/hyperspectral algorithm development, Canadian
Journal of Remote Sensing 25 (1999) 99–111.
[6] R. He, S. Sun, X. Yu, C. Xue, W. Zhang, P. Torr, S. Bai, X. Qi, Is synthetic data from generative
models ready for image recognition?, arXiv preprint arXiv:2210.07574 (2022).
[7] D. P. Kingma, M. Welling, et al., Auto-encoding variational bayes, 2013.
[8] D. Im Im, S. Ahn, R. Memisevic, Y. Bengio, Denoising criterion for variational auto-encoding
framework, in: Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
[9] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, N. Johnston, Variational image compression with a scale
hyperprior, arXiv preprint arXiv:1802.01436 (2018).
[10] I. Goodfellow, et al., Generative adversarial nets, Advances in neural information processing
systems 27 (2014).
[11] U. Maqsud, Synthetic text generation for sentiment analysis, in: Proceedings of the 6th Workshop
on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, 2015, pp.
156–161.
[12] Z. Li, H. Zhu, Z. Lu, M. Yin, Synthetic data generation with large language models for text
classification: Potential and limitations, arXiv preprint arXiv:2310.07849 (2023).
[13] D. Manousakas, S. Aydöre, On the usefulness of synthetic tabular data generation, arXiv preprint
arXiv:2306.15636 (2023).
[14] J. Young, P. Graham, R. Penny, Using bayesian networks to create synthetic data, Journal of</p>
      <p>Oficial Statistics 25 (2009) 549–567.
[15] J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, X. Xiao, Privbayes: Private data release via
bayesian networks, ACM Transactions on Database Systems (TODS) 42 (2017) 1–41.
[16] S. Kamthe, S. Assefa, M. Deisenroth, Copula flows for synthetic data generation, arXiv preprint
arXiv:2101.00598 (2021).
[17] D. Meyer, T. Nagler, R. J. Hogan, Copula-based synthetic data augmentation for machine-learning
emulators, Geoscientific Model Development 14 (2021) 5205–5215.
[18] L. Xu, M. Skoularidou, A. Cuesta-Infante, K. Veeramachaneni, Modeling tabular data using
conditional gan, Advances in neural information processing systems 32 (2019).
[19] L. Hansen, N. Seedat, M. van der Schaar, A. Petrovic, Reimagining synthetic tabular data generation
through data-centric ai: A comprehensive benchmark, Advances in neural information processing
systems 36 (2023) 33781–33823.
[20] Z. Zhao, A. Kunar, R. Birke, H. Van der Scheer, L. Y. Chen, Ctab-gan+: Enhancing tabular data
synthesis, Frontiers in big Data 6 (2024) 1296508.
[21] N. Shahbazi, Y. Lin, A. Asudeh, H. Jagadish, Representation bias in data: A survey on identification
and resolution techniques, ACM Computing Surveys 55 (2023) 1–39.
[22] V. Iosifidis, E. Ntoutsi, Dealing with bias via data augmentation in supervised learning scenarios,</p>
      <p>Jo Bates Paul D. Clough Robert Jäschke 24 (2018).
[23] A. Kortylewski, B. Egger, A. Schneider, T. Gerig, A. Morel-Forster, T. Vetter, Analyzing and
reducing the damage of dataset bias to face recognition with synthetic data, in: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Figueira</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Vaz,</surname>
          </string-name>
          <article-title>Survey on synthetic data generation, evaluation methods and gans</article-title>
          ,
          <source>Mathematics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          ). URL: https://www.mdpi.com/2227-7390/10/15/2733. doi:
          <volume>10</volume>
          .3390/math10152733.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Inan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>McAnallen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shajari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Levitan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sim</surname>
          </string-name>
          ,
          <article-title>Synthetic text generation with diferential privacy: A simple and practical recipe</article-title>
          ,
          <source>arXiv preprint arXiv:2210.14348</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Szpruch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Houssiau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bottarelli</surname>
          </string-name>
          , G. Cherubin,
          <string-name>
            <given-names>C.</given-names>
            <surname>Maple</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Weller</surname>
          </string-name>
          ,
          <article-title>Synthetic data-what, why</article-title>
          and how?,
          <source>arXiv preprint arXiv:2205.03257</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>