<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Empirical evaluation of amplifying privacy by subsampling for GANs to create diferentially private synthetic tabular data.⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Valtteri A. Nieminen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tapio Pahikkala</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antti Airola</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Turku, Department of Computing</institution>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Privacy concerns often limit sharing sensitive data collected from individuals. One proposed solution to make secondary use possible is privacy-preserving synthetic data that attempts to mimic real data. Due to their success on non-private tasks, GAN networks trained with diferentially private stochastic gradient descent (DPSGD) have been popular for generating DP synthetic data. In recent years, a prominent approach to achieving better privacy guarantees has been to train ensembles of discriminator networks with DPSDG on mutually exclusive subsets to obtain better diferential privacy guarantees by taking advantage of the synergy between GANs and privacy amplification by subsampling. However, this research has been done almost exclusively on images, and empirical evaluations of this strategy on other types of data are lacking. This work focuses on the efects of subsampling in creating DP synthetic tabular data with GANs. We evaluate synthetic data utility by training classification models on synthetic- and testing on real data at varying subsampling rates. Further, we complement the evaluation with a qualitative examination of the generated data. Our findings show that while subsampling does bring benefits with tabular data in terms of the prediction performance for classifiers trained on synthetic data, the resulting samples can be very distorted compared to original real data. The results suggest that the benefits obtainable via this method of training DP GAN can difer significantly based on the type of data used.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine Learning</kwd>
        <kwd>Diferential Privacy</kwd>
        <kwd>GAN</kwd>
        <kwd>Synthetic Data</kwd>
        <kwd>Privacy Amplification by Subsampling</kwd>
        <kwd>Tabular Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>TKTP 2023: Annual Symposium for Computer Science 2023, June 13–14,
2023, Oulu, Finland
* Corresponding author.
$ vajnie@utu.fi (V. A. Nieminen)</p>
      <p>0000-0002-3550-0561 (V. A. Nieminen); 0000-0003-4183-2455
(T. Pahikkala); 0000-0002-1010-4386 (A. Airola)</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License turbation makes it worse, whether or not the calculation
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) is perturbed, there is a tradeof [ 5, 7] between privacy
and utility. How good DP synthetic data is, comes down of subsample-and-aggregate [5, 15] to DPSGD results
to optimizing this tradeof. in only the updates to  network cost privacy while</p>
      <p>Due to their success in non-private settings, Genera- can be trained without privacy costs, while also reaping
tive adversarial networks (GAN) (see e.g. [8]), trained the benefits of PABS by training multiple discriminators.
privately, have been among the most utilized model types This opens up many possibilities for further optimization:
to create DP synthetic data. GAN comprises two types of for example, the  networks can be pre-trained before
neural networks, a generator  and a discriminator , updates to  and  trained for more iterations .
Howboth initialized simultaneously and trained in tandem in ever, further research on this method has been lacking,
an adversarial setup [9]. The goal is that  learns to pro- with some claiming that the large number of networks
duce samples similar to those of the real data distribution trained hinders practical usability due to the time and
based on the feedback of . This feedback concerns how resources taken to train so many networks [18].
well  can discern whether the sample was synthetic Most DP GAN synthetic data generation research has
or came from the real data distribution based on the ap- been conducted on image data. However, the
overwhelmproximation  has learned of the real data distribution ing majority of, for example, health data is in tabular form.
during training. Here, tabular data is referred to as data where
observa</p>
      <p>DPSGD [6] was groundbreaking because it enabled tions are on rows and columns represent features of those
the training of deep ML models with meaningful privacy observations, that may or may not be of mixed type. At
guarantees. At the heart of why DPSGD made this pos- time, Chen et al., achieved state-of-the-art results using
sible is that it takes advantage of privacy amplification the handwritten digits MNIST [19] and Fashion-MNIST
by subsampling (PABS) [6, 10, 5] a well-known result image datasets [20]. The absence of empirical results on
vital for modern privacy-preserving algorithms. PABS tabular data leaves open interesting questions on data
describes how privacy is amplified when an algorithm is modality and the usability of this method to train DP
run only on a subset of the whole data. In DPSGD, this GAN. As said, the approach requires dividing the data
result is applied to mini-batching, seeing each batch as a into many mutually exclusive subsets. Unlike tabular
subset. Intuitively, amplification results from an adver- data, where features may or may not be correlated or
sary being unable to know which points were chosen for important for some task at hand, the features of images
which iteration of the algorithm. The resulting improve- are autocorrelated, as they depict parts of a whole.
ment to privacy is very significant even for datasets of This paper presents an empirical investigation into
modest size and can, depending on the sampling strategy, generating DP tabular synthetic data using a GAN trained
be roughly as large as  *  , where  stands for the size with the subsampled DPSGD strategy presented by Chen
of the subset and  the size of the whole training data. et al. [17]. We conduct experiments with tabular data</p>
      <p>While the first works on DP GANs took DPSGD and using the freely available Cardio [21] dataset. The
expermore or less simply combined it with a GAN to make iments include a standard downstream classification
utilinformation on the sensitive, real data flowing to the ity task in which classifiers are trained on synthetic and
discriminator private (see e.g. [11, 12]), soon methods tested on real data. Unlike previous works, we focus on
taking advantage of the specifics of GANs were intro- the efect of this training strategy in particular by varying
duced. PATE-GAN [13], took the PATE [14] mechanism, the number of discriminators trained with mutually
exa version of the subsample-and-aggregate framework of clusive subsets across the experiments. The downstream
DP [5, 15], and used it to train a GAN model using ag- classification utility experiment is augmented with a
qualgregated votes of an ensemble of multiple discriminators. itative examination of the structure of the synthetic data
Long et al. [16] took this approach further with G-PATE generated. To the authors’ knowledge, this work is the
and noted that only the sensitive information flowing to ifrst to present an evaluation that focuses on subsampled
the generator needs to be sanitized as only the generator DPSGD training of DP-GAN with tabular data rather
is released (after the network is trained, discriminators than images.
are not needed for generating synthetic data). This
allowed for training a large ensemble of discriminators on
exclusive subsets of data, taking advantage of the synergy 2. Preliminaries
between PABS and the GAN setup, where discriminator
and generator networks are separated. 2.1. Diferential privacy</p>
      <p>Unlike G-PATE, where information  provides to  A randomized algorithm ℳ is (,  )-diferentially private
is discretized to votes, the work on GS-WGAN by Chen if for adjacent datasets 1 and 2, meaning they difer
et al. [17] worked the subsample-and-aggregate idea by at most one record, and for all measurable sets  of
into DPSGD, improving the results of previous works outputs, the following inequality holds:
by aggregating the gradient of large discriminator
ensembles ( &gt; 1000) to train the generator. This application
 [ℳ(1) ∈ ] ≤   [ℳ(2) ∈ ] +</p>
      <p>(1)</p>
      <p>Where ℳ is, for example, one iteration of DPSGD
training,  is the upper bound for privacy loss, and  is
a small probability of a catastrophic breach of the
DPguarantee [5]. A smaller  stands for a stronger guarantee.</p>
      <p>Although a single acceptable value for the privacy budget
 can not be given as it depends on the context, in the
literature, values of  ≤
task, values of  ≤
protection [22] and depending on the type of data and
10 have been seen to still result in</p>
      <p>1 have been seen as very strong
meaningful guarantees [6]. Informally  can be said to
depict the worst-case of how much information, that can
not be learned from other individuals data, can be learned
from the output concerning a specific individual.</p>
      <p>The model in this work uses Rényi diferential
privacy (RDP) [23], another formulation of DP often used
with DP deep learning models to get tighter bounds of
composing DP guarantees over iterations. In this paper,
due to interpretability, Rényi DP bounds are converted
to (,  ). The privacy loss of training is tracked via the
subsampled Rényi moments accountant [24]. There exist
many ways to compose privacy costs of sequential runs
of a DP algorithm. Naively, this is a summation, but by
using advanced techniques, a more eficient composition
can be achieved.</p>
      <p>This work uses diferentially private stochastic
gradient descent (DPSGD) [6] to optimize the DP GAN. DPSGD
difers from its non-private counterpart in that prior to
updating model parameters, the maximum influence of
an individual data point can have on the output, called
the sensitivity [5] of the function is bounded by clipping
gradients. Clipping is followed by adding noise from a
noise mechanism. Noise mechanisms like the gaussian
mechanism used in this work are functions from which
noise calibrated to a specific sensitivity can be sampled
[25]. The choice of noise mechanism largely depends on
the type of information sanitized.</p>
      <p>A DPSDG training step, that is, one run of the DP
optimization algorithm, can be summarized as follows:
1. Gradients before sanitation are calculated with
backpropagation as in non-DP SGD. At
training step  these are:
∇ℒ</p>
      <p>( ) = ∇ ℒ ( ,  ),
where   and   are the discriminator and
generator network’s weights.
2. Gradient information is sanitized by bounding
the sensitivity, clipping the gradient vectors to a
maximum of C, and adding noise:
clip
︁(
ˆ 
∇
∇
= ℳ,
(), 
︁)
︁(
∇
())︁</p>
      <p>=
+  ︀( 0,  22)︀</p>
      <sec id="sec-1-1">
        <title>3. The parameters of the model are updated using</title>
        <p>the sanitized gradients as in normal gradient
descent:  (+1) :=  ()
−  · ∇ˆ()</p>
        <sec id="sec-1-1-1">
          <title>2.2. DP synthetic data</title>
          <p>DP synthetic data generation is possible due to the
postprocessing property [5] of diferential privacy, which
guarantees that outputs, in this case, synthetic data, of
any process that is DP are also DP. Importantly, DP
guarantees are not actually over the synthetic data but the
algorithm that generated it.</p>
          <p>Evaluation of DP synthetic data can be said [1] to lie on
three axes: privacy, utility, and fidelity
(also sometimes
called sample quality). Utility is simply the usefulness of
the synthetic data for a given task. In this work, the
downstream classification task, where a
downstream model, a
model trained on synthetic data, is evaluated on real data,
is concerned with utility. Fidelity, refers to, how closely
the statistical properties of the synthetic data are
preserved. What exactly this means difers on the measure
used, but often, for example, correlational structure and
distributions are compared between synthetic and real
data.
2.3. GAN
Generative Adversarial Networks [9] are a type of
generative model where training is formulated as a competitive
game between two networks: a generator  and a
discriminator . The goal is that  learns a mapping from
some bounded domain, usually a noise vector denoted
with z, to an approximation of the distribution of real
data based on the -network’s feedback.</p>
          <p>The  can be used to generate samples from the
distribution it has learned, that is, , by feeding z to
the network as input: (z).  discriminates between
this generated output (z) and real data. GAN have
been thought to have some inherent privacy attributes,
such as resistance to overfitting [ 8], because  only
interacts with the real data indirectly by receiving
information from , which does not define an explicit density
function but learns an approximation during training.</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Few works on these inherent properties exist, but even</title>
        <p>non-DP GAN have been shown to provide some weak
protection against membership inference attacks [26].</p>
      </sec>
      <sec id="sec-1-3">
        <title>The model used in this work is based on a Wasserstein</title>
      </sec>
      <sec id="sec-1-4">
        <title>GAN with gradient penalty (WGAN-GP) [27]. In the</title>
        <p>context of Wasserstein GAN, [27], the discriminator is
called a critic, but in this work, it is referred to as a
discriminator as well to avoid extra terminology.</p>
      </sec>
      <sec id="sec-1-5">
        <title>The choice of the Wasserstein loss and use of gradient</title>
        <p>penalty [27] is non-trivial as it has privacy-synergies
with DPSGD clipping [17]. The Wasserstein loss is based
on the Earth Mover’s distance (EMD). For EMD to be
valid, the 1-Lipschitz continuity condition must hold (see
Definition 1).</p>
        <p>R
Definition 1 (Lipschitz-continuity). A function f :
R → R is globally L-Lipschitz continuous if there exists
an L ≥ 0 such that ‖ () −  ()‖ ≤ ‖ − ‖ ∀,  ∈</p>
        <p>If the continuity holds, gradient magnitudes during
training are approximately between [− 1, 1] [17]. The
gradient penalty regularization term [27] is used as a soft
restraint to make the condition hold. This is beneficial
for DPSGD training, as then setting the clipping bound
 = 1 should be a close to optimal choice and a costly
search for the hyperparameter value is avoided [17].
Definition 2 (Wasserstein-1 loss of D and G [27]).
ℒ = − E∼ ^ [()] + E˜∼  [(˜)] + GP
ℒ = − E∼  [(())]</p>
        <p>Where  is the noise sampled from a normal
distribution given as input to G to generate samples, 
is the regularization strength hyperparameter of the
gradient-penalty term,  is the real data and ˜ is the
data generated by G. GP is the gradient penalty term
 E ︀[ (‖∇(  + (1 −  )˜)‖2 − 1)2]︀ and  ∼  [0, 1]
is the interpolation coeficient and ∇ [27].</p>
        <sec id="sec-1-5-1">
          <title>2.4. Subsample-and-aggregate and privacy amplification</title>
          <p>The work of Chen et al., [17] can be seen to be a
successor to a line of works, especially the G-PATE [16], which
adapts the subsample-and-aggregate [5] framework of
DP, first formalized by Nissim et al. [ 15], to DPSGD and
training of multiple discriminators on a GAN setup to
reap privacy amplification by subsampling (PABS)
beneifts.</p>
        </sec>
      </sec>
      <sec id="sec-1-6">
        <title>Privacy amplification by subsampling is a well-studied</title>
        <p>subject with bounds for diferent sampling strategies,
such as without replacement or with replacement having
been worked out extensively, especially in the works of
Kasiviswanathan et al., [28] and Balle et al., [10]. PABS
is induced in the model of this work by training a large
number of  networks on mutually exclusive subsets
and randomly querying them at each  update step. This
corresponds to PABS for sampling without replacement
[10], with an amplification efect roughly proportional to
 , where  is the number of mutually exclusive subsets

the data is split into and  the size of the whole training
data.</p>
        <p>Figure 1 depicts the sanitation of gradients [17] during
the update steps of the generator. As seen from the figure,
where the sanitation bound, or "privacy barrier" as called
by [17] is placed "between" the two networks. This is an
important emphasis, because, it is what allows training
the  networks without incurring privacy costs. If the
sanitation would be between, for example, the real data
and the  networks, every time they see real data would
result in a privacy cost.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Materials and methods</title>
      <sec id="sec-2-1">
        <title>3.1. Model specifications</title>
        <p>The freely available code [29] of GS-WGAN by Chen et
al., [17] was used as a basis of the implementation, but
the architecture and gradient clipping procedure code
was re-implemented to fit the tabular data case, replacing
the convolutional architecture with a fully-connected
one using Pytorch (v. 1.4.0) [30]. Choices for the new
architecture specifications were made based on a limited
number of tests with less than five training runs with
diferent seeds per choice, such as the width or depth of
the network and the number of  training repetitions
per generator iteration, denoted . Unless mentioned
here, hyperparameter choices were those recommended
by [27].</p>
        <p>The  network used was a fully connected network
with two hidden layers, the largest being of size 256 with
16 outputs. The size of the noise vector z was set to 32,
based on experimenting on a few usual settings. The
activation function used was ReLU [31], except in the
last layer of where a hyperbolic tangent (TanH) was used,
due to the range of the Wasserstein loss function (both
[-1, 1]). The  classifier network was a typical
multilayer-perceptron architecture with one hidden layer, size
128. Instead of a ReLU, as in the  network, a LeakyReLU
[32] with  (negative slope) value set to 0.2 was used in
the hidden layers as in [27].
The publicly available Cardiovascular Disease dataset
[21] consists of 70 000 observations and 12 features, five
binary, two categorical, and five numeric. The
classification target is to predict the presence of cardiovascular
disease (the feature ’Cardio’). Table 1 lists all features
and their types. This dataset was chosen for
reproducibility and to work as a feasible proxy for common tabular
EHR data; the condition to predict is common, and the
features are routinely collected during doctor’s
examinations. The number of patients with cardiovascular
disease in the dataset is balanced. Blood pressure values
were limited to a range of +-20 from values indicating
a hypertensive crisis according to Finnish national
standards [33], afecting 1064 values of ap_hi and 312 values
of ap_lo. KNN-imputation (see [34]) with  = 3 was car- 4.2. Sample fidelity
ried out to replace these, using the implementation from
the scikit-learn package [35] (version 1.0.2). The two
categorical variables ’cholesterol’ and ’glucose’ were
onehot encoded, after which all features were min-maxed to
[-1,1] to match the feature values range of the  network
output layer.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments and evaluation</title>
      <p>The quality of synthetic data is evaluated from two
viewpoints; downstream classification utility and sample
fidelity. This section gives an overview of the downstream
utility experiment conducted and the DP synthetic data
generation process. The downstream classification utility
experiment is depicted in Algorithm 1.</p>
      <sec id="sec-3-1">
        <title>4.1. Downstream utility experiment</title>
        <p>Downstream classification utility is a standard way of
evaluating DP synthetic data and the method used to
generate it (see e.g. [36, 17, 37]). In this experiment,
synthetic data is used to train a downstream model, which
is tested against real data. In this work’s binary
classification task, a logistic regression (LR) classifier from the
[35] (version 1.0.2) package was used as the model of
choice for classification and accuracy was measured with
the AUC metric [38].</p>
        <p>Five private Generators were trained for the
downstream classification task each up to a maximum of 40
000 iterations, using an amount of pre-trained 
networks corresponding to the subsampling rates,  =
1/250, 1/500, 1/750, 1/1000, 1/1500, that is fraction
of real data in each mutually exclusive subset. In
addition, a non-private  network was trained to compare
the efects of the generating process only. In all cases, the
discriminators  were pre-trained for 2000 iterations.</p>
        <p>Every model was saved once per 1000 iterations, in this
work referred to as checkpoints. Each of these saved states
of the model were evaluated separately. Hyperparameter
optimization over the choice of regularization term and
regularization strength was conducted for the logistic
regression classifier separately for each checkpoint at
diferent iterations. The real and synthetic data used for
this experiment were split to training, validation, and
test sets with size corresponding to fractions (0.8/0.1/0.1)
of the real dataset. The resulting set sizes were 56000 for
the synthetic training, for the synthetic validation and
7000 for the real test set.</p>
        <p>A total of 287 full model selection runs (40 checkpoints
at each of the 6 diferent  settings and the additional real
versus real baseline case) consisting of hyperparameter
optimization and evaluation with the best
hyperparameters settings were conducted.</p>
        <p>A comparative assessment of the efects of diferent
privacy- and subsampling levels on the method’s ability
to retain sample fidelity in this work consists of three
comparisons; correlational structure using Spearman’s
rank correlation coeficient, a visual examination of the
change of the continuous feature distributions, and a
visual examination of the change in binary and categorical
variable distributions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Results</title>
      <sec id="sec-4-1">
        <title>5.1. Downstream classification utility</title>
        <p>Figure 2 compares results of the downstream
classification utility experiment (train on synthetic, test on
term choice of either 1 or 2
1/1500} do</p>
        <sec id="sec-4-1-1">
          <title>Algorithm 1: Downstream Classification Quality</title>
          <p>Experiment
1 Create sets ℎ ∈ H, where H denotes combinations
between 20 values of  randomly sampled from a
logarithmic space between (0, 1) and regularization
2 for subsampling rate  in {1, 1/250, 1/500, 1/750, 1/1000,
0.8
pretrain  (denominator of  ), networks 
train ( using  ), save "checkpoint models"
 1,  2 . . .  40 every 1000 iterations

5 for  in {1, 1/250, 1/500, 1/750, 1/1000, 1/1500} do
for i in {1k, 2k, . . . , 40k} do
split dataset X of size  with stratification by
the Y variable ’cardio’ to X, X, and
X with proportions 0.8/0.1/0.1 * .
sample s = 0.9 *  synthetic data points from
ℎ and ℎ
  and split with stratification to
for h ∈</p>
          <p>H do
train classifier  with
hyperparameters ℎ and data
ℎ
evaluate  against ℎ
save best ℎ in ℎ
train classifier  with ℎ and data
ℎ combined with ℎ
evaluate  using X
save the result of the downstream
classification utility test for  
empty the set ℎ</p>
          <p>= 1
AUC with Real Data (0.795)
AUC with Real Data (0.795)</p>
          <p>Best model</p>
          <p>= 1/1500 and are com- evaluations for a specific model is chosen as the "best model".
pared to a non-private Generator denoted .</p>
          <p>The visualizations of distributions in Figure 3 are from data
The best overall results in terms of the  - AUC trade- generated by these "best" models.
with data sampled from 1000, which ultimately failed
to reach the same AUC, with its highest score being 0.687
at  = 14.7.</p>
          <p>In general, models with weaker privacy guarantees
and a smaller subsampling rate were able to reach higher
values of AUC eventually, but at high privacy costs. In
comparison to 1500 that reached the best tradeof, for
example, 500 reached the value 0.717, close to that of
1500 at</p>
          <p>= 34.8, nearly six times more. The best</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>AUC value obtained with privacy-preserving models was</title>
          <p>reached by 250 at an AUC of 0.752 with  = 63.0.</p>
          <p>D
e
n
its
y
D
e
n
its
y
D
e
n
its
y
D
e
n
its
y</p>
          <p>Real
1/1500
ce1/1000
ou 11//570500
r
S 1/250
NO_DP</p>
          <p>Real
1/1500
ce1/1000
ou 11//570500
r
S 1/250
NO_DP</p>
          <p>Real
1/1500
ce1/1000
ou 11//570500
r
S 1/250
NO_DP
Smoking
Alcohol
Active
Cardio
0.6
0.6
0.6
0.6
0.4</p>
          <p>0.6</p>
          <p>Cholesterol
0.4</p>
          <p>0.6</p>
          <p>Glucose
0.4 0.6
Proportion of counts
0.8
0.8
0.8
0.8
0.8
0.8
0.8
 = 6 and a higher  . In the case of features where the
number of positive cases is low to begin with, such as
’alcohol’, adding DP seems to often further decrease the
5.2. Sample fidelity amount of positives. For the categorical features
’cholesFigure 3 shows distributions of continuous features gen- terol’ and ’glucose’, stronger privacy guarantees such
erated by models performing best in the downstream as in the case of M1500 seem to also balance the size
classification task at settings 250, 750, 1500 as well diferences between the counts.
as the non-private , all compared with the real Figure 7.3 shows a comparison between Spearman
data feature distributions. Note that the y-axis density Rank correlation coeficient values calculated between
value range varies to provide better resolution for each the continuous features across synthetic data sampled
variable. Interestingly, there is visible x-axis shift in es- from the best performing models shown in 7.2.
Signifipecially the samples from models where  is larger. cant correlations are marked with (*) for a significance</p>
          <p>Figure 4 compares binary and categorical feature dis- level of p &lt; 0.05 and (**) for p &lt; 0.01. Even in the case of
tributions of data sampled from DP models, the baseline the non-private synthetic data sampled from ,
model  and real data.  appears to cap- many of the dependencies in the real data are lost, as
ture distributions of the real data well, but when DP is is the case with, for example, the correlation between
applied there are considerable deviations from the real ’weight’ and ’ap_hi’. In addition, the synthetic datasets,
data case, especially with 1500 with stricter guarantees especially those sampled from private models show new
WHAAeeppiAigg__ghhhleotti -00001...A..11200g3251*e********* H001-..0e.310.i86g0***h1***t ReWa001..le.240i21Dg***h***atta 0A1.p.60_4*l*o** A1p.0_*h*i 00100.....5702005050 ttsehhatvaibaeplTlutrphvliaeaemesnrrsaoydaagpatmmepttsaphourlespceeehrctdofidaordietsenolutcitcfidootyehefgdlernfidieteiicynszlaiuastutblyehtnl.sieoisf.ofosIfirntmmuCcdahloygyen,enatednrseaadptst,eatatc,,hlit.taehhl[eilpysis1csyw7tenu]eittrmshhehesDiotstwiPoc</p>
          <p>Age 1.0** Mbaseline, AUC = 0.788 1.00 dmeovdiaetlisoantssitnricatlelrthprei vfidaecliytygueaxraamnitneeasti,ohnasvewshuebnstcaonmti-al
Height -0.03** 1.0** 0.75 pared to real data. This is especially evident in the binary
Weight -0.03** 0.13** 1.0** 0.50 feature distributions when there are few positive
observaAApp__hloi 00..A10g28e**** -H00.e.20i4g1*h**t* --W00..e00ig32h****t 0A1.p.50_3*l*o** A1p.0_*h*i 00..2050 tsihoonws itnogbtehgaintDwPitShG, Dwhtricahincinorgreasfepcotsnidmsbwailtahncoethdefreraetusurelts
Age 1.0** M1/250, = 63, AUC = 0.752 1.00 tdiiosntrsibtuutrinonaslmdoissptaorpatpeolysit[e39a]n.
dInfraodmditnioonn,-ssiogmnieficcaonrtrteolaHeight -0.09** 1.0** 0.75 significant.</p>
          <p>Weight -0.05** 0.18** 1.0** 0.50 The described efect worsens with the subsampling
Ap_lo 0.14** 0.31** 0.28** 1.0** 0.25 rate rising, but it does not afect the downstream
classifiAp_hi 0.13** -0.14** -0.13** -0.1** 1.0** 0.00 cation metric nearly as much. This could be indicative</p>
          <p>Age Height Weight Ap_lo Ap_hi of the diferences of tabular and image data
hypotheHeiAgghet -01..104**** M1/715.00*,* = 18, AUC = 0.751 01..7050 sciozrerdeleaaterdlieor:r tiambpuolarrtadnattafofreastoumrees tmasakyaotrhmanayd,nwothiblee
Weight 0.12** 0.38** 1.0** 0.50 the features of images are autocorrelated, as they depict
Ap_lo 0.1** -0.15** -0.22** 1.0** 0.25 parts of a whole. Tabular data may lose almost all signal
Ap_hi 0.35** -0.14** 0.03** 0.52** 1.0** 0.00 in some features, while with images, the perturbation</p>
          <p>Age Height Weight Ap_lo Ap_hi is applied more evenly due to autocorrelation and less
Age 1.0** M1/1500, = 6 AUC = 0.717 1.0 imAbalcalneacre ilnimtihteatdioisntroibfutthioisnsp.aper is that it only uses
WHeeiigghhtt -00..2343**** -01..302**** 1.0** 0.5 one dataset. This is due to the significant computational
AApp__hloi 00..2094**** 00..1018**** -00..0271**** 10.0.0** 1.0** 0.0 edxupcet nthseese;xthpeertiimmeenitt
atotooknetosutrbasianmapwlihnogleramteoedxecleaenddedco1n5Age Height Weight Ap_lo Ap_hi hours for the model with the highest subsampling rate
 = 1/1500, not counting the hyperparameter
optimizaFigure 5: Spearman rank correlation coeficients of continu- tion using three NVIDIA RTX Titan GPU’s.
Subsampleous variables compared between synthetic and real dataset and-aggregate-based methods have been criticized for
adcorwonsssttrheraeme colfastshieficbaetsiotnpeurtfiolirtmyeinxgpeprrimivaentet.mThoedeslysnitnhetthiec being computationally expensive [18]. However, this
data generated by the non-private  as well as the question is not as black and white because, although
real data case are included for comparison. One asterisk (*) expensive, subsampling could incur great privacy
bendenotes a significant at a p-value &lt; 0.05 and ** marks signifi- efits for some types of data while only having a small
cance at level &lt; 0.01. detrimental efect on model performance.</p>
          <p>Compared with other works using the same data, Fang
et al., [40] reported better results. However, it has been
correlations that are not present in the real data. since noted that their approach of adding DP to the
conditional GAN of [41] is not DP since it oversamples the
data and the DP mechanism is not random. RDP-CGAN
6. Conclusion of Torfi et al., [ 36] reported results visually in a figure of
approximately   = 0, 72 at  = 10, which fall short
In the downstream classification utility task (see Figure of the results of this work.
2), the private models, especially those with higher sub- The results of this study suggest that
subsample-andsampling rates, required more training iterations to ob- aggregate DPSDG training also brings benefits with
tabtain high AUC values. However, the benefits to privacy ular data; however, with a higher cost to fidelity than
attained with the subsample-and-aggregate DPSGD strat- with images. From a broader perspective, this work adds
egy outweighed these costs. This result suggest that the to the line of thought [1], that useful DP synthetic data
benefits seen with PABS and image data by Chen et al. can be made specifically for some problems, but making
[17] can also be reaped in terms of downstream tasks "general" synthetic data, where all features would be
prewhen tabular data with mixed features is used. How- served well and which could be used like real data is very
dificult if not impossible. [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,</p>
          <p>Future work could have a closer look at how benefits D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio,
from subsampling and data structure relate, using, for Generative adversarial nets, Advances in neural
example, simulated data to control more parameters. An- information processing systems 27 (2014).
other direction relates to taking advantage of free train- [10] B. Balle, G. Barthe, M. Gaboardi, Privacy
amplificaing of the discriminators. For example, ways to track tion by subsampling: Tight analyses via couplings
when generator training steps would be optimal with and divergences, Advances in Neural Information
this method could result in significant benefits. Processing Systems 31 (2018).
[11] L. Xie, K. Lin, S. Wang, F. Wang, J. Zhou,</p>
          <p>Diferentially private generative adversarial
net7. Acknowledgements work. corr abs/1802.06739 (2018), arXiv preprint
arXiv:1802.06739 (2018).</p>
          <p>This work has been conducted as part of the PRI- [12] B. K. Beaulieu-Jones, Z. S. Wu, C. Williams, R. Lee,
VASA project funded by Business Finland (grant number S. P. Bhavnani, J. B. Byrd, C. S. Greene,
Privacy37428/31/2020). preserving generative deep neural networks
support clinical data sharing, Circulation:
CardiovasReferences cular Quality and Outcomes 12 (2019).
[13] J. Jordon, J. Yoon, M. Van Der Schaar, Pate-gan:
[1] J. Jordon, L. Szpruch, F. Houssiau, M. Bottarelli, Generating synthetic data with diferential privacy
G. Cherubin, C. Maple, S. N. Cohen, A. Weller, Syn- guarantees, in: International conference on
learnthetic data – what, why and how?, 2022. URL: https: ing representations, 2019.
//arxiv.org/abs/2205.03257. doi:10.48550/ARXIV. [14] N. Papernot, S. Song, I. Mironov, A. Raghunathan,
2205.03257. K. Talwar, Ú. Erlingsson, Scalable private learning
[2] A. Wood, M. Altman, A. Bembenek, M. Bun, with PATE, arXiv preprint: 1802.08908 (2018).</p>
          <p>M. Gaboardi, J. Honaker, K. Nissim, D. R. O’Brien, [15] K. Nissim, S. Raskhodnikova, A. Smith, Smooth
T. Steinke, S. Vadhan, Diferential privacy: A primer sensitivity and sampling in private data analysis,
for a non-technical audience, Vand. J. Ent. &amp; Tech. in: Proceedings of the thirty-ninth annual ACM
L. 21 (2018) 209. symposium on Theory of computing, 2007, pp. 75–
[3] J. Hayes, L. Melis, G. Danezis, E. De Cristofaro, Lo- 84.</p>
          <p>gan: Membership inference attacks against gen- [16] Y. Long, S. Lin, Z. Yang, C. A. Gunter, B. Li, Scalable
erative models, arXiv preprint arXiv:1705.07663 diferentially private generative student model via
(2017). PATE, arXiv preprint : 1906.09338 (2019).
[4] D. Chen, N. Yu, Y. Zhang, M. Fritz, Gan-leaks: A [17] D. Chen, T. Orekondy, M. Fritz, GS-WGAN: A
taxonomy of membership inference attacks against gradient-sanitized approach for learning
diferengenerative models, in: Proceedings of the 2020 tially private generators, Advances in Neural
InforACM SIGSAC conference on computer and commu- mation Processing Systems 33 (2020) 12673–12684.
nications security, 2020, pp. 343–362. [18] T. Cao, A. Bie, A. Vahdat, S. Fidler, K. Kreis, Don’t
[5] C. Dwork, A. Roth, et al., The algorithmic foun- generate me: Training diferentially private
generadations of diferential privacy, Foundations and tive models with sinkhorn divergence, Advances in
Trends in Theoretical Computer Science 9 (2014) Neural Information Processing Systems 34 (2021)
211–407. 12480–12492.
[6] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, [19] L. Deng, The MNIST database of handwritten digit
I. Mironov, K. Talwar, L. Zhang, Deep learning images for machine learning research, IEEE Signal
with diferential privacy, in: Proceedings of the Processing Magazine 29 (2012) 141–142.
2016 ACM SIGSAC conference on computer and [20] H. Xiao, K. Rasul, R. Vollgraf, Fashion-MNIST:
communications security, 2016, pp. 308–318. a novel image dataset for benchmarking
ma[7] T. Stadler, B. Oprisanu, C. Troncoso, Syn- chine learning algorithms, arXiv preprint
thetic data – anonymisation groundhog day, in: arXiv:1708.07747 (2017).
31st USENIX Security Symposium (USENIX Se- [21] S. Ulianova, Cardiovascular disease dataset, 2019.
curity 22), USENIX Association, Boston, MA, URL: https://www.kaggle.com/sulianova/.
2022, pp. 1451–1468. URL: https://www.usenix.org/ [22] C. Arnold, M. Neunhoefer, Really useful synthetic
conference/usenixsecurity22/presentation/stadler. data–A framework to evaluate the quality of
dif[8] I. Goodfellow, Nips 2016 tutorial: Generative adver- ferentially private synthetic data, arXiv preprint
sarial networks, arXiv preprint arXiv:1701.00160 arXiv:2004.07740 (2020).
(2016). [23] I. Mironov, Rényi diferential privacy, in: 2017
IEEE 30th computer security foundations sympo- Journal of Machine Learning Research 12 (2011)
sium (CSF), IEEE, 2017, pp. 263–275. 2825–2830.
[24] Y.-X. Wang, B. Balle, S. P. Kasiviswanathan, Sub- [36] A. Torfi, E. A. Fox, C. K. Reddy, Diferentially
prisampled Rényi diferential privacy and analytical vate synthetic medical data generation using
conmoments accountant, in: The 22nd International volutional GANs, Information Sciences 586 (2022)
Conference on Artificial Intelligence and Statistics, 485–500.</p>
          <p>PMLR, 2019, pp. 1226–1235. [37] Y. Tao, R. McKenna, M. Hay, A.
Machanava[25] C. Dwork, F. McSherry, K. Nissim, A. Smith, Cali- jjhala, G. Miklau, Benchmarking diferentially
pribrating noise to sensitivity in private data analysis, vate synthetic data generation algorithms, CoRR
in: Theory of cryptography conference, Springer, abs/2112.09238 (2021). URL: https://arxiv.org/abs/
2006, pp. 265–284. 2112.09238. arXiv:2112.09238.
[26] Z. Lin, V. Sekar, G. Fanti, On the privacy properties [38] A. P. Bradley, The use of the area under the roc
of gan-generated samples, in: International Confer- curve in the evaluation of machine learning
algoence on Artificial Intelligence and Statistics, PMLR, rithms, Pattern recognition 30 (1997) 1145–1159.
2021, pp. 1522–1530. [39] V. M. Suriyakumar, N. Papernot, A. Goldenberg,
[27] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, M. Ghassemi, Chasing your long tails:
DiferenA. C. Courville, Improved training of Wasserstein tially private prediction in health care settings, in:
GANs, Advances in neural information processing Proceedings of the 2021 ACM Conference on
Fairsystems 30 (2017). ness, Accountability, and Transparency, 2021, pp.
[28] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, 723–734.</p>
          <p>S. Raskhodnikova, A. Smith, What can we learn [40] M. L. Fang, D. S. Dhami, K. Kersting, Dp-ctgan:
privately?, SIAM Journal on Computing 40 (2011) Diferentially private medical data generation using
793–826. ctgans, in: Artificial Intelligence in Medicine: 20th
[29] D. Chen, GS-WGAN github-repository, https:// International Conference on Artificial Intelligence
github.com/DingfanChen/GS-WGAN, 2020. in Medicine, AIME 2022, Halifax, NS, Canada, June
[30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- 14–17, 2022, Proceedings, Springer, 2022, pp. 178–
bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, 188.</p>
          <p>L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, [41] L. Xu, M. Skoularidou, A. Cuesta-Infante, K.
VeeraM. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, machaneni, Modeling tabular data using
condiL. Fang, J. Bai, S. Chintala, Pytorch: An imperative tional gan, Advances in Neural Information
Prostyle, high-performance deep learning library, in: cessing Systems 32 (2019).</p>
          <p>Advances in Neural Information Processing
Systems 32, Curran Associates, Inc., 2019, pp. 8024–
8035.
[31] K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun,</p>
          <p>What is the best multi-stage architecture for object
recognition?, in: 2009 IEEE 12th international
conference on computer vision, IEEE, 2009, pp. 2146–
2153.
[32] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al.,
Rectiifer nonlinearities improve neural network acoustic
models, in: Proc. icml, volume 30, Citeseer, 2013,
p. 3.
[33] The Finnish Medical Society Duodecim, Current</p>
          <p>Care Guidelines: Treatment of hypertensive crisis,
2020. URL: https://www.kaypahoito.fi/hoi04010.
[34] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown,</p>
          <p>T. Hastie, R. Tibshirani, D. Botstein, R. B. Altman,
Missing value estimation methods for DNA
microarrays, Bioinformatics 17 (2001) 520–525.
[35] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,</p>
          <p>B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, E.
Duchesnay, Scikit-learn: Machine learning in Python,</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>