1. Introduction

Empirical evaluation of amplifying privacy by subsampling for GANs to create diferentially private synthetic tabular data.⋆

Valtteri A. Nieminen

Tapio Pahikkala

Antti Airola

0 0 University of Turku, Department of Computing , Finland

Privacy concerns often limit sharing sensitive data collected from individuals. One proposed solution to make secondary use possible is privacy-preserving synthetic data that attempts to mimic real data. Due to their success on non-private tasks, GAN networks trained with diferentially private stochastic gradient descent (DPSGD) have been popular for generating DP synthetic data. In recent years, a prominent approach to achieving better privacy guarantees has been to train ensembles of discriminator networks with DPSDG on mutually exclusive subsets to obtain better diferential privacy guarantees by taking advantage of the synergy between GANs and privacy amplification by subsampling. However, this research has been done almost exclusively on images, and empirical evaluations of this strategy on other types of data are lacking. This work focuses on the efects of subsampling in creating DP synthetic tabular data with GANs. We evaluate synthetic data utility by training classification models on synthetic- and testing on real data at varying subsampling rates. Further, we complement the evaluation with a qualitative examination of the generated data. Our findings show that while subsampling does bring benefits with tabular data in terms of the prediction performance for classifiers trained on synthetic data, the resulting samples can be very distorted compared to original real data. The results suggest that the benefits obtainable via this method of training DP GAN can difer significantly based on the type of data used.

eol>Machine Learning Diferential Privacy GAN Synthetic Data Privacy Amplification by Subsampling Tabular Data

1. Introduction

TKTP 2023: Annual Symposium for Computer Science 2023, June 13–14, 2023, Oulu, Finland * Corresponding author. $ vajnie@utu.fi (V. A. Nieminen)

0000-0002-3550-0561 (V. A. Nieminen); 0000-0003-4183-2455 (T. Pahikkala); 0000-0002-1010-4386 (A. Airola)

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License turbation makes it worse, whether or not the calculation CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) is perturbed, there is a tradeof [ 5, 7] between privacy and utility. How good DP synthetic data is, comes down of subsample-and-aggregate [5, 15] to DPSGD results to optimizing this tradeof. in only the updates to network cost privacy while

Due to their success in non-private settings, Genera- can be trained without privacy costs, while also reaping tive adversarial networks (GAN) (see e.g. [8]), trained the benefits of PABS by training multiple discriminators. privately, have been among the most utilized model types This opens up many possibilities for further optimization: to create DP synthetic data. GAN comprises two types of for example, the networks can be pre-trained before neural networks, a generator and a discriminator , updates to and trained for more iterations . Howboth initialized simultaneously and trained in tandem in ever, further research on this method has been lacking, an adversarial setup [9]. The goal is that learns to pro- with some claiming that the large number of networks duce samples similar to those of the real data distribution trained hinders practical usability due to the time and based on the feedback of . This feedback concerns how resources taken to train so many networks [18]. well can discern whether the sample was synthetic Most DP GAN synthetic data generation research has or came from the real data distribution based on the ap- been conducted on image data. However, the overwhelmproximation has learned of the real data distribution ing majority of, for example, health data is in tabular form. during training. Here, tabular data is referred to as data where observa

DPSGD [6] was groundbreaking because it enabled tions are on rows and columns represent features of those the training of deep ML models with meaningful privacy observations, that may or may not be of mixed type. At guarantees. At the heart of why DPSGD made this pos- time, Chen et al., achieved state-of-the-art results using sible is that it takes advantage of privacy amplification the handwritten digits MNIST [19] and Fashion-MNIST by subsampling (PABS) [6, 10, 5] a well-known result image datasets [20]. The absence of empirical results on vital for modern privacy-preserving algorithms. PABS tabular data leaves open interesting questions on data describes how privacy is amplified when an algorithm is modality and the usability of this method to train DP run only on a subset of the whole data. In DPSGD, this GAN. As said, the approach requires dividing the data result is applied to mini-batching, seeing each batch as a into many mutually exclusive subsets. Unlike tabular subset. Intuitively, amplification results from an adver- data, where features may or may not be correlated or sary being unable to know which points were chosen for important for some task at hand, the features of images which iteration of the algorithm. The resulting improve- are autocorrelated, as they depict parts of a whole. ment to privacy is very significant even for datasets of This paper presents an empirical investigation into modest size and can, depending on the sampling strategy, generating DP tabular synthetic data using a GAN trained be roughly as large as * , where stands for the size with the subsampled DPSGD strategy presented by Chen of the subset and the size of the whole training data. et al. [17]. We conduct experiments with tabular data

While the first works on DP GANs took DPSGD and using the freely available Cardio [21] dataset. The expermore or less simply combined it with a GAN to make iments include a standard downstream classification utilinformation on the sensitive, real data flowing to the ity task in which classifiers are trained on synthetic and discriminator private (see e.g. [11, 12]), soon methods tested on real data. Unlike previous works, we focus on taking advantage of the specifics of GANs were intro- the efect of this training strategy in particular by varying duced. PATE-GAN [13], took the PATE [14] mechanism, the number of discriminators trained with mutually exa version of the subsample-and-aggregate framework of clusive subsets across the experiments. The downstream DP [5, 15], and used it to train a GAN model using ag- classification utility experiment is augmented with a qualgregated votes of an ensemble of multiple discriminators. itative examination of the structure of the synthetic data Long et al. [16] took this approach further with G-PATE generated. To the authors’ knowledge, this work is the and noted that only the sensitive information flowing to ifrst to present an evaluation that focuses on subsampled the generator needs to be sanitized as only the generator DPSGD training of DP-GAN with tabular data rather is released (after the network is trained, discriminators than images. are not needed for generating synthetic data). This allowed for training a large ensemble of discriminators on exclusive subsets of data, taking advantage of the synergy 2. Preliminaries between PABS and the GAN setup, where discriminator and generator networks are separated. 2.1. Diferential privacy

Unlike G-PATE, where information provides to A randomized algorithm ℳ is (, )-diferentially private is discretized to votes, the work on GS-WGAN by Chen if for adjacent datasets 1 and 2, meaning they difer et al. [17] worked the subsample-and-aggregate idea by at most one record, and for all measurable sets of into DPSGD, improving the results of previous works outputs, the following inequality holds: by aggregating the gradient of large discriminator ensembles ( > 1000) to train the generator. This application [ℳ(1) ∈ ] ≤ [ℳ(2) ∈ ] +

(1)

Where ℳ is, for example, one iteration of DPSGD training, is the upper bound for privacy loss, and is a small probability of a catastrophic breach of the DPguarantee [5]. A smaller stands for a stronger guarantee.

Although a single acceptable value for the privacy budget can not be given as it depends on the context, in the literature, values of ≤ task, values of ≤ protection [22] and depending on the type of data and 10 have been seen to still result in

1 have been seen as very strong meaningful guarantees [6]. Informally can be said to depict the worst-case of how much information, that can not be learned from other individuals data, can be learned from the output concerning a specific individual.

The model in this work uses Rényi diferential privacy (RDP) [23], another formulation of DP often used with DP deep learning models to get tighter bounds of composing DP guarantees over iterations. In this paper, due to interpretability, Rényi DP bounds are converted to (, ). The privacy loss of training is tracked via the subsampled Rényi moments accountant [24]. There exist many ways to compose privacy costs of sequential runs of a DP algorithm. Naively, this is a summation, but by using advanced techniques, a more eficient composition can be achieved.

This work uses diferentially private stochastic gradient descent (DPSGD) [6] to optimize the DP GAN. DPSGD difers from its non-private counterpart in that prior to updating model parameters, the maximum influence of an individual data point can have on the output, called the sensitivity [5] of the function is bounded by clipping gradients. Clipping is followed by adding noise from a noise mechanism. Noise mechanisms like the gaussian mechanism used in this work are functions from which noise calibrated to a specific sensitivity can be sampled [25]. The choice of noise mechanism largely depends on the type of information sanitized.

A DPSDG training step, that is, one run of the DP optimization algorithm, can be summarized as follows: 1. Gradients before sanitation are calculated with backpropagation as in non-DP SGD. At training step these are: ∇ℒ

( ) = ∇ ℒ ( , ), where and are the discriminator and generator network’s weights. 2. Gradient information is sanitized by bounding the sensitivity, clipping the gradient vectors to a maximum of C, and adding noise: clip ︁( ˆ ∇ ∇ = ℳ, (), ︁) ︁( ∇ ())︁

= + ︀( 0, 22)︀

3. The parameters of the model are updated using

the sanitized gradients as in normal gradient descent: (+1) := () − · ∇ˆ()

2.2. DP synthetic data

DP synthetic data generation is possible due to the postprocessing property [5] of diferential privacy, which guarantees that outputs, in this case, synthetic data, of any process that is DP are also DP. Importantly, DP guarantees are not actually over the synthetic data but the algorithm that generated it.

Evaluation of DP synthetic data can be said [1] to lie on three axes: privacy, utility, and fidelity (also sometimes called sample quality). Utility is simply the usefulness of the synthetic data for a given task. In this work, the downstream classification task, where a downstream model, a model trained on synthetic data, is evaluated on real data, is concerned with utility. Fidelity, refers to, how closely the statistical properties of the synthetic data are preserved. What exactly this means difers on the measure used, but often, for example, correlational structure and distributions are compared between synthetic and real data. 2.3. GAN Generative Adversarial Networks [9] are a type of generative model where training is formulated as a competitive game between two networks: a generator and a discriminator . The goal is that learns a mapping from some bounded domain, usually a noise vector denoted with z, to an approximation of the distribution of real data based on the -network’s feedback.

The can be used to generate samples from the distribution it has learned, that is, , by feeding z to the network as input: (z). discriminates between this generated output (z) and real data. GAN have been thought to have some inherent privacy attributes, such as resistance to overfitting [ 8], because only interacts with the real data indirectly by receiving information from , which does not define an explicit density function but learns an approximation during training.

Few works on these inherent properties exist, but even

non-DP GAN have been shown to provide some weak protection against membership inference attacks [26].

The model used in this work is based on a Wasserstein GAN with gradient penalty (WGAN-GP) [27]. In the

context of Wasserstein GAN, [27], the discriminator is called a critic, but in this work, it is referred to as a discriminator as well to avoid extra terminology.

The choice of the Wasserstein loss and use of gradient

penalty [27] is non-trivial as it has privacy-synergies with DPSGD clipping [17]. The Wasserstein loss is based on the Earth Mover’s distance (EMD). For EMD to be valid, the 1-Lipschitz continuity condition must hold (see Definition 1).

R Definition 1 (Lipschitz-continuity). A function f : R → R is globally L-Lipschitz continuous if there exists an L ≥ 0 such that ‖ () − ()‖ ≤ ‖ − ‖ ∀, ∈

If the continuity holds, gradient magnitudes during training are approximately between [− 1, 1] [17]. The gradient penalty regularization term [27] is used as a soft restraint to make the condition hold. This is beneficial for DPSGD training, as then setting the clipping bound = 1 should be a close to optimal choice and a costly search for the hyperparameter value is avoided [17]. Definition 2 (Wasserstein-1 loss of D and G [27]). ℒ = − E∼ ^ [()] + E˜∼ [(˜)] + GP ℒ = − E∼ [(())]

Where is the noise sampled from a normal distribution given as input to G to generate samples, is the regularization strength hyperparameter of the gradient-penalty term, is the real data and ˜ is the data generated by G. GP is the gradient penalty term E ︀[ (‖∇( + (1 − )˜)‖2 − 1)2]︀ and ∼ [0, 1] is the interpolation coeficient and ∇ [27].

2.4. Subsample-and-aggregate and privacy amplification

The work of Chen et al., [17] can be seen to be a successor to a line of works, especially the G-PATE [16], which adapts the subsample-and-aggregate [5] framework of DP, first formalized by Nissim et al. [ 15], to DPSGD and training of multiple discriminators on a GAN setup to reap privacy amplification by subsampling (PABS) beneifts.

Privacy amplification by subsampling is a well-studied

subject with bounds for diferent sampling strategies, such as without replacement or with replacement having been worked out extensively, especially in the works of Kasiviswanathan et al., [28] and Balle et al., [10]. PABS is induced in the model of this work by training a large number of networks on mutually exclusive subsets and randomly querying them at each update step. This corresponds to PABS for sampling without replacement [10], with an amplification efect roughly proportional to , where is the number of mutually exclusive subsets the data is split into and the size of the whole training data.

Figure 1 depicts the sanitation of gradients [17] during the update steps of the generator. As seen from the figure, where the sanitation bound, or "privacy barrier" as called by [17] is placed "between" the two networks. This is an important emphasis, because, it is what allows training the networks without incurring privacy costs. If the sanitation would be between, for example, the real data and the networks, every time they see real data would result in a privacy cost.

3. Materials and methods 3.1. Model specifications

The freely available code [29] of GS-WGAN by Chen et al., [17] was used as a basis of the implementation, but the architecture and gradient clipping procedure code was re-implemented to fit the tabular data case, replacing the convolutional architecture with a fully-connected one using Pytorch (v. 1.4.0) [30]. Choices for the new architecture specifications were made based on a limited number of tests with less than five training runs with diferent seeds per choice, such as the width or depth of the network and the number of training repetitions per generator iteration, denoted . Unless mentioned here, hyperparameter choices were those recommended by [27].

The network used was a fully connected network with two hidden layers, the largest being of size 256 with 16 outputs. The size of the noise vector z was set to 32, based on experimenting on a few usual settings. The activation function used was ReLU [31], except in the last layer of where a hyperbolic tangent (TanH) was used, due to the range of the Wasserstein loss function (both [-1, 1]). The classifier network was a typical multilayer-perceptron architecture with one hidden layer, size 128. Instead of a ReLU, as in the network, a LeakyReLU [32] with (negative slope) value set to 0.2 was used in the hidden layers as in [27]. The publicly available Cardiovascular Disease dataset [21] consists of 70 000 observations and 12 features, five binary, two categorical, and five numeric. The classification target is to predict the presence of cardiovascular disease (the feature ’Cardio’). Table 1 lists all features and their types. This dataset was chosen for reproducibility and to work as a feasible proxy for common tabular EHR data; the condition to predict is common, and the features are routinely collected during doctor’s examinations. The number of patients with cardiovascular disease in the dataset is balanced. Blood pressure values were limited to a range of +-20 from values indicating a hypertensive crisis according to Finnish national standards [33], afecting 1064 values of ap_hi and 312 values of ap_lo. KNN-imputation (see [34]) with = 3 was car- 4.2. Sample fidelity ried out to replace these, using the implementation from the scikit-learn package [35] (version 1.0.2). The two categorical variables ’cholesterol’ and ’glucose’ were onehot encoded, after which all features were min-maxed to [-1,1] to match the feature values range of the network output layer.

4. Experiments and evaluation

The quality of synthetic data is evaluated from two viewpoints; downstream classification utility and sample fidelity. This section gives an overview of the downstream utility experiment conducted and the DP synthetic data generation process. The downstream classification utility experiment is depicted in Algorithm 1.

4.1. Downstream utility experiment

Downstream classification utility is a standard way of evaluating DP synthetic data and the method used to generate it (see e.g. [36, 17, 37]). In this experiment, synthetic data is used to train a downstream model, which is tested against real data. In this work’s binary classification task, a logistic regression (LR) classifier from the [35] (version 1.0.2) package was used as the model of choice for classification and accuracy was measured with the AUC metric [38].

Five private Generators were trained for the downstream classification task each up to a maximum of 40 000 iterations, using an amount of pre-trained networks corresponding to the subsampling rates, = 1/250, 1/500, 1/750, 1/1000, 1/1500, that is fraction of real data in each mutually exclusive subset. In addition, a non-private network was trained to compare the efects of the generating process only. In all cases, the discriminators were pre-trained for 2000 iterations.

Every model was saved once per 1000 iterations, in this work referred to as checkpoints. Each of these saved states of the model were evaluated separately. Hyperparameter optimization over the choice of regularization term and regularization strength was conducted for the logistic regression classifier separately for each checkpoint at diferent iterations. The real and synthetic data used for this experiment were split to training, validation, and test sets with size corresponding to fractions (0.8/0.1/0.1) of the real dataset. The resulting set sizes were 56000 for the synthetic training, for the synthetic validation and 7000 for the real test set.

A total of 287 full model selection runs (40 checkpoints at each of the 6 diferent settings and the additional real versus real baseline case) consisting of hyperparameter optimization and evaluation with the best hyperparameters settings were conducted.

A comparative assessment of the efects of diferent privacy- and subsampling levels on the method’s ability to retain sample fidelity in this work consists of three comparisons; correlational structure using Spearman’s rank correlation coeficient, a visual examination of the change of the continuous feature distributions, and a visual examination of the change in binary and categorical variable distributions.

5. Results 5.1. Downstream classification utility

Figure 2 compares results of the downstream classification utility experiment (train on synthetic, test on term choice of either 1 or 2 1/1500} do

Algorithm 1: Downstream Classification Quality

Experiment 1 Create sets ℎ ∈ H, where H denotes combinations between 20 values of randomly sampled from a logarithmic space between (0, 1) and regularization 2 for subsampling rate in {1, 1/250, 1/500, 1/750, 1/1000, 0.8 pretrain (denominator of ), networks train ( using ), save "checkpoint models" 1, 2 . . . 40 every 1000 iterations 5 for in {1, 1/250, 1/500, 1/750, 1/1000, 1/1500} do for i in {1k, 2k, . . . , 40k} do split dataset X of size with stratification by the Y variable ’cardio’ to X, X, and X with proportions 0.8/0.1/0.1 * . sample s = 0.9 * synthetic data points from ℎ and ℎ and split with stratification to for h ∈

H do train classifier with hyperparameters ℎ and data ℎ evaluate against ℎ save best ℎ in ℎ train classifier with ℎ and data ℎ combined with ℎ evaluate using X save the result of the downstream classification utility test for empty the set ℎ

= 1 AUC with Real Data (0.795) AUC with Real Data (0.795)

Best model

= 1/1500 and are com- evaluations for a specific model is chosen as the "best model". pared to a non-private Generator denoted .

The visualizations of distributions in Figure 3 are from data The best overall results in terms of the - AUC trade- generated by these "best" models. with data sampled from 1000, which ultimately failed to reach the same AUC, with its highest score being 0.687 at = 14.7.

In general, models with weaker privacy guarantees and a smaller subsampling rate were able to reach higher values of AUC eventually, but at high privacy costs. In comparison to 1500 that reached the best tradeof, for example, 500 reached the value 0.717, close to that of 1500 at

= 34.8, nearly six times more. The best

AUC value obtained with privacy-preserving models was

reached by 250 at an AUC of 0.752 with = 63.0.

D e n its y D e n its y D e n its y D e n its y

Real 1/1500 ce1/1000 ou 11//570500 r S 1/250 NO_DP

Real 1/1500 ce1/1000 ou 11//570500 r S 1/250 NO_DP Smoking Alcohol Active Cardio 0.6 0.6 0.6 0.6 0.4

0.6

Cholesterol 0.4

0.6

Glucose 0.4 0.6 Proportion of counts 0.8 0.8 0.8 0.8 0.8 0.8 0.8 = 6 and a higher . In the case of features where the number of positive cases is low to begin with, such as ’alcohol’, adding DP seems to often further decrease the 5.2. Sample fidelity amount of positives. For the categorical features ’cholesFigure 3 shows distributions of continuous features gen- terol’ and ’glucose’, stronger privacy guarantees such erated by models performing best in the downstream as in the case of M1500 seem to also balance the size classification task at settings 250, 750, 1500 as well diferences between the counts. as the non-private , all compared with the real Figure 7.3 shows a comparison between Spearman data feature distributions. Note that the y-axis density Rank correlation coeficient values calculated between value range varies to provide better resolution for each the continuous features across synthetic data sampled variable. Interestingly, there is visible x-axis shift in es- from the best performing models shown in 7.2. Signifipecially the samples from models where is larger. cant correlations are marked with (*) for a significance

Figure 4 compares binary and categorical feature dis- level of p < 0.05 and (**) for p < 0.01. Even in the case of tributions of data sampled from DP models, the baseline the non-private synthetic data sampled from , model and real data. appears to cap- many of the dependencies in the real data are lost, as ture distributions of the real data well, but when DP is is the case with, for example, the correlation between applied there are considerable deviations from the real ’weight’ and ’ap_hi’. In addition, the synthetic datasets, data case, especially with 1500 with stricter guarantees especially those sampled from private models show new WHAAeeppiAigg__ghhhleotti -00001...A..11200g3251*e********* H001-..0e.310.i86g0***h1***t ReWa001..le.240i21Dg***h***atta 0A1.p.60_4*l*o** A1p.0_*h*i 00100.....5702005050 ttsehhatvaibaeplTlutrphvliaeaemesnrrsaoydaagpatmmepttsaphourlespceeehrctdofidaordietsenolutcitcfidootyehefgdlernfidieteiicynszlaiuastutblyehtnl.sieoisf.ofosIfirntmmuCcdahloygyen,enatednrseaadptst,eatatc,,hlit.taehhl[eilpysis1csyw7tenu]eittrmshhehesDiotstwiPoc

Age 1.0** Mbaseline, AUC = 0.788 1.00 dmeovdiaetlisoantssitnricatlelrthprei vfidaecliytygueaxraamnitneeasti,ohnasvewshuebnstcaonmti-al Height -0.03** 1.0** 0.75 pared to real data. This is especially evident in the binary Weight -0.03** 0.13** 1.0** 0.50 feature distributions when there are few positive observaAApp__hloi 00..A10g28e**** -H00.e.20i4g1*h**t* --W00..e00ig32h****t 0A1.p.50_3*l*o** A1p.0_*h*i 00..2050 tsihoonws itnogbtehgaintDwPitShG, Dwhtricahincinorgreasfepcotsnidmsbwailtahncoethdefreraetusurelts Age 1.0** M1/250, = 63, AUC = 0.752 1.00 tdiiosntrsibtuutrinonaslmdoissptaorpatpeolysit[e39a]n. dInfraodmditnioonn,-ssiogmnieficcaonrtrteolaHeight -0.09** 1.0** 0.75 significant.

Weight -0.05** 0.18** 1.0** 0.50 The described efect worsens with the subsampling Ap_lo 0.14** 0.31** 0.28** 1.0** 0.25 rate rising, but it does not afect the downstream classifiAp_hi 0.13** -0.14** -0.13** -0.1** 1.0** 0.00 cation metric nearly as much. This could be indicative

Age Height Weight Ap_lo Ap_hi of the diferences of tabular and image data hypotheHeiAgghet -01..104**** M1/715.00*,* = 18, AUC = 0.751 01..7050 sciozrerdeleaaterdlieor:r tiambpuolarrtadnattafofreastoumrees tmasakyaotrhmanayd,nwothiblee Weight 0.12** 0.38** 1.0** 0.50 the features of images are autocorrelated, as they depict Ap_lo 0.1** -0.15** -0.22** 1.0** 0.25 parts of a whole. Tabular data may lose almost all signal Ap_hi 0.35** -0.14** 0.03** 0.52** 1.0** 0.00 in some features, while with images, the perturbation

Age Height Weight Ap_lo Ap_hi is applied more evenly due to autocorrelation and less Age 1.0** M1/1500, = 6 AUC = 0.717 1.0 imAbalcalneacre ilnimtihteatdioisntroibfutthioisnsp.aper is that it only uses WHeeiigghhtt -00..2343**** -01..302**** 1.0** 0.5 one dataset. This is due to the significant computational AApp__hloi 00..2094**** 00..1018**** -00..0271**** 10.0.0** 1.0** 0.0 edxupcet nthseese;xthpeertiimmeenitt atotooknetosutrbasianmapwlihnogleramteoedxecleaenddedco1n5Age Height Weight Ap_lo Ap_hi hours for the model with the highest subsampling rate = 1/1500, not counting the hyperparameter optimizaFigure 5: Spearman rank correlation coeficients of continu- tion using three NVIDIA RTX Titan GPU’s. Subsampleous variables compared between synthetic and real dataset and-aggregate-based methods have been criticized for adcorwonsssttrheraeme colfastshieficbaetsiotnpeurtfiolirtmyeinxgpeprrimivaentet.mThoedeslysnitnhetthiec being computationally expensive [18]. However, this data generated by the non-private as well as the question is not as black and white because, although real data case are included for comparison. One asterisk (*) expensive, subsampling could incur great privacy bendenotes a significant at a p-value < 0.05 and ** marks signifi- efits for some types of data while only having a small cance at level < 0.01. detrimental efect on model performance.

Compared with other works using the same data, Fang et al., [40] reported better results. However, it has been correlations that are not present in the real data. since noted that their approach of adding DP to the conditional GAN of [41] is not DP since it oversamples the data and the DP mechanism is not random. RDP-CGAN 6. Conclusion of Torfi et al., [ 36] reported results visually in a figure of approximately = 0, 72 at = 10, which fall short In the downstream classification utility task (see Figure of the results of this work. 2), the private models, especially those with higher sub- The results of this study suggest that subsample-andsampling rates, required more training iterations to ob- aggregate DPSDG training also brings benefits with tabtain high AUC values. However, the benefits to privacy ular data; however, with a higher cost to fidelity than attained with the subsample-and-aggregate DPSGD strat- with images. From a broader perspective, this work adds egy outweighed these costs. This result suggest that the to the line of thought [1], that useful DP synthetic data benefits seen with PABS and image data by Chen et al. can be made specifically for some problems, but making [17] can also be reaped in terms of downstream tasks "general" synthetic data, where all features would be prewhen tabular data with mixed features is used. How- served well and which could be used like real data is very dificult if not impossible. [9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,

Future work could have a closer look at how benefits D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, from subsampling and data structure relate, using, for Generative adversarial nets, Advances in neural example, simulated data to control more parameters. An- information processing systems 27 (2014). other direction relates to taking advantage of free train- [10] B. Balle, G. Barthe, M. Gaboardi, Privacy amplificaing of the discriminators. For example, ways to track tion by subsampling: Tight analyses via couplings when generator training steps would be optimal with and divergences, Advances in Neural Information this method could result in significant benefits. Processing Systems 31 (2018). [11] L. Xie, K. Lin, S. Wang, F. Wang, J. Zhou,

Diferentially private generative adversarial net7. Acknowledgements work. corr abs/1802.06739 (2018), arXiv preprint arXiv:1802.06739 (2018).

This work has been conducted as part of the PRI- [12] B. K. Beaulieu-Jones, Z. S. Wu, C. Williams, R. Lee, VASA project funded by Business Finland (grant number S. P. Bhavnani, J. B. Byrd, C. S. Greene, Privacy37428/31/2020). preserving generative deep neural networks support clinical data sharing, Circulation: CardiovasReferences cular Quality and Outcomes 12 (2019). [13] J. Jordon, J. Yoon, M. Van Der Schaar, Pate-gan: [1] J. Jordon, L. Szpruch, F. Houssiau, M. Bottarelli, Generating synthetic data with diferential privacy G. Cherubin, C. Maple, S. N. Cohen, A. Weller, Syn- guarantees, in: International conference on learnthetic data – what, why and how?, 2022. URL: https: ing representations, 2019. //arxiv.org/abs/2205.03257. doi:10.48550/ARXIV. [14] N. Papernot, S. Song, I. Mironov, A. Raghunathan, 2205.03257. K. Talwar, Ú. Erlingsson, Scalable private learning [2] A. Wood, M. Altman, A. Bembenek, M. Bun, with PATE, arXiv preprint: 1802.08908 (2018).

M. Gaboardi, J. Honaker, K. Nissim, D. R. O’Brien, [15] K. Nissim, S. Raskhodnikova, A. Smith, Smooth T. Steinke, S. Vadhan, Diferential privacy: A primer sensitivity and sampling in private data analysis, for a non-technical audience, Vand. J. Ent. & Tech. in: Proceedings of the thirty-ninth annual ACM L. 21 (2018) 209. symposium on Theory of computing, 2007, pp. 75– [3] J. Hayes, L. Melis, G. Danezis, E. De Cristofaro, Lo- 84.

gan: Membership inference attacks against gen- [16] Y. Long, S. Lin, Z. Yang, C. A. Gunter, B. Li, Scalable erative models, arXiv preprint arXiv:1705.07663 diferentially private generative student model via (2017). PATE, arXiv preprint : 1906.09338 (2019). [4] D. Chen, N. Yu, Y. Zhang, M. Fritz, Gan-leaks: A [17] D. Chen, T. Orekondy, M. Fritz, GS-WGAN: A taxonomy of membership inference attacks against gradient-sanitized approach for learning diferengenerative models, in: Proceedings of the 2020 tially private generators, Advances in Neural InforACM SIGSAC conference on computer and commu- mation Processing Systems 33 (2020) 12673–12684. nications security, 2020, pp. 343–362. [18] T. Cao, A. Bie, A. Vahdat, S. Fidler, K. Kreis, Don’t [5] C. Dwork, A. Roth, et al., The algorithmic foun- generate me: Training diferentially private generadations of diferential privacy, Foundations and tive models with sinkhorn divergence, Advances in Trends in Theoretical Computer Science 9 (2014) Neural Information Processing Systems 34 (2021) 211–407. 12480–12492. [6] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, [19] L. Deng, The MNIST database of handwritten digit I. Mironov, K. Talwar, L. Zhang, Deep learning images for machine learning research, IEEE Signal with diferential privacy, in: Proceedings of the Processing Magazine 29 (2012) 141–142. 2016 ACM SIGSAC conference on computer and [20] H. Xiao, K. Rasul, R. Vollgraf, Fashion-MNIST: communications security, 2016, pp. 308–318. a novel image dataset for benchmarking ma[7] T. Stadler, B. Oprisanu, C. Troncoso, Syn- chine learning algorithms, arXiv preprint thetic data – anonymisation groundhog day, in: arXiv:1708.07747 (2017). 31st USENIX Security Symposium (USENIX Se- [21] S. Ulianova, Cardiovascular disease dataset, 2019. curity 22), USENIX Association, Boston, MA, URL: https://www.kaggle.com/sulianova/. 2022, pp. 1451–1468. URL: https://www.usenix.org/ [22] C. Arnold, M. Neunhoefer, Really useful synthetic conference/usenixsecurity22/presentation/stadler. data–A framework to evaluate the quality of dif[8] I. Goodfellow, Nips 2016 tutorial: Generative adver- ferentially private synthetic data, arXiv preprint sarial networks, arXiv preprint arXiv:1701.00160 arXiv:2004.07740 (2020). (2016). [23] I. Mironov, Rényi diferential privacy, in: 2017 IEEE 30th computer security foundations sympo- Journal of Machine Learning Research 12 (2011) sium (CSF), IEEE, 2017, pp. 263–275. 2825–2830. [24] Y.-X. Wang, B. Balle, S. P. Kasiviswanathan, Sub- [36] A. Torfi, E. A. Fox, C. K. Reddy, Diferentially prisampled Rényi diferential privacy and analytical vate synthetic medical data generation using conmoments accountant, in: The 22nd International volutional GANs, Information Sciences 586 (2022) Conference on Artificial Intelligence and Statistics, 485–500.

PMLR, 2019, pp. 1226–1235. [37] Y. Tao, R. McKenna, M. Hay, A. Machanava[25] C. Dwork, F. McSherry, K. Nissim, A. Smith, Cali- jjhala, G. Miklau, Benchmarking diferentially pribrating noise to sensitivity in private data analysis, vate synthetic data generation algorithms, CoRR in: Theory of cryptography conference, Springer, abs/2112.09238 (2021). URL: https://arxiv.org/abs/ 2006, pp. 265–284. 2112.09238. arXiv:2112.09238. [26] Z. Lin, V. Sekar, G. Fanti, On the privacy properties [38] A. P. Bradley, The use of the area under the roc of gan-generated samples, in: International Confer- curve in the evaluation of machine learning algoence on Artificial Intelligence and Statistics, PMLR, rithms, Pattern recognition 30 (1997) 1145–1159. 2021, pp. 1522–1530. [39] V. M. Suriyakumar, N. Papernot, A. Goldenberg, [27] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, M. Ghassemi, Chasing your long tails: DiferenA. C. Courville, Improved training of Wasserstein tially private prediction in health care settings, in: GANs, Advances in neural information processing Proceedings of the 2021 ACM Conference on Fairsystems 30 (2017). ness, Accountability, and Transparency, 2021, pp. [28] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, 723–734.

S. Raskhodnikova, A. Smith, What can we learn [40] M. L. Fang, D. S. Dhami, K. Kersting, Dp-ctgan: privately?, SIAM Journal on Computing 40 (2011) Diferentially private medical data generation using 793–826. ctgans, in: Artificial Intelligence in Medicine: 20th [29] D. Chen, GS-WGAN github-repository, https:// International Conference on Artificial Intelligence github.com/DingfanChen/GS-WGAN, 2020. in Medicine, AIME 2022, Halifax, NS, Canada, June [30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- 14–17, 2022, Proceedings, Springer, 2022, pp. 178– bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, 188.

L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, [41] L. Xu, M. Skoularidou, A. Cuesta-Infante, K. VeeraM. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, machaneni, Modeling tabular data using condiL. Fang, J. Bai, S. Chintala, Pytorch: An imperative tional gan, Advances in Neural Information Prostyle, high-performance deep learning library, in: cessing Systems 32 (2019).

Advances in Neural Information Processing Systems 32, Curran Associates, Inc., 2019, pp. 8024– 8035. [31] K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun,

What is the best multi-stage architecture for object recognition?, in: 2009 IEEE 12th international conference on computer vision, IEEE, 2009, pp. 2146– 2153. [32] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al., Rectiifer nonlinearities improve neural network acoustic models, in: Proc. icml, volume 30, Citeseer, 2013, p. 3. [33] The Finnish Medical Society Duodecim, Current

Care Guidelines: Treatment of hypertensive crisis, 2020. URL: https://www.kaypahoito.fi/hoi04010. [34] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown,

T. Hastie, R. Tibshirani, D. Botstein, R. B. Altman, Missing value estimation methods for DNA microarrays, Bioinformatics 17 (2001) 520–525. [35] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,

B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python,