1. Introduction

Privacy Amplification for Episodic Training Methods

Vandy Tombs

0 1

Olivera Kotevska

0 1

Steven Young

0 1 0 Oak Ridge National Laboratory , 1 Bethel Valley Road, Oak Ridge, TN 37830 , USA 1 This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The publisher acknowledges the US government license to provide public access under the DOE Public Access Plan (

2022

It has been shown that diferential privacy bounds improve when subsampling within a randomized mechanism. Episodic training, utilized in many standard machine learning techniques, uses a multistage subsampling procedure which has not been previously analyzed for privacy bound amplification. In this paper, we focus on improving the calculation of privacy bounds in episodic training by thoroughly analyzing privacy amplification due to subsampling with a multi-stage subsampling procedure. The newly developed bound can be incorporated into existing privacy accounting methods.

1. Introduction

the privacy amplification due to subsampling as well as a complete analysis for Poisson and simple random subsamAs more data is being utilized by algorithms and machine pling both with and without replacement subsampling learning techniques, rigorously maintaining the privacy methods. of this data has become important. Cyber security, health, The subsampling methods analyzed previously include and census data collection are all examples of fields that many of the subsampling methods utilized by machine are seeing increased scrutiny for ensuring the privacy learning; however, the methods does not capture batches of data, and it is well known that just anonymizing the formed by algorithms that use episodic training. Episodic data by removing features such as name is not suficient training methods are utilized by a variety of machine to guarantee privacy due to vulnerabilities such as re- learning algorithms, such as meta learning (e.g., [ 8, 9 ]) identification attacks, especially in the case when an or metric learning (e.g., [ 10, 11 ]) algorithms. Domain adversary has access to auxiliary knowledge or data (see generalization algorithms have also frequently utilized e.g. [ 1, 2 ]). episodic training [ 12 ].

Diferential privacy, first introduced by Dwork, is one In this paper, we analyze the privacy amplification technical definition of privacy that has been studied due to the subsampling method utilized in an episodic widely in the literature [ 3, 4 ]. This definition provides training regime. Specifically, we notice forming batches rigorous guarantees for the privacy of data that is uti- in episodic training is a multistage subsampling method, lized by an algorithm and has several nice properties like and we provide a complete analysis of the improved difrobustness to post processing and strong composition ferential privacy bounds when applying a mechanism to theorems. a sample drawn using multistage subsampling. The re

Machine learning practitioners initially integrated dif- sulting theorem can be easily applied to episodic training ferential privacy by naively applying these composition methods and integrated with privacy accounting meththeorems algorithm by assuming that the algorithm ac- ods such as the moment’s accountant [ 5 ]. This bound cessed the entire training set on each step of training. can also be utilized by practitioners of other domains Abadi et al. [ 5 ] noticed the data is subsampled into that use multistage subsampling within their algorithms. batches, so only a subset of the data is utilized for each step of training. This allowed for improved privacy bounds; however, they assumed that batches were created us- 2. Background and Related Works ing Poisson sampling. Later authors showed improved bounds for creating batches using simple random sam- 2.1. Multistage Subsampling pling without replacement [ 6 ]. And most recently, Balle et al. [ 7 ] provided a fully unified theory for determining

In a multistage sampling procedure, the universe from which samples are drawn is partitioned. These partitions may contain the examples we are ultimately interested in sampling or may contain one or several levels of partitions. The subsampling procedure is to sample partitions at each level until examples are sampled. For example, if we are interested in the demographics of students at a school, we could partition students by teacher, sample some number of teachers and then sample students from each sampled teacher.

To see that episodic training is a multistage subsampling procedure, consider how training batches are formed Given > 0, let ′ = (1+ ( − 1)) and = /′ , in Algorithm 2 of [ 8 ]. In this work, a subset of tasks are the following holds: sampled from a collection of tasks, then the examples are sampled and provided to the training algorithm. This is ′ ( || ′) = ( 1||(1 − ) 0 + ′1) a 2-stage sampling procedure since the training data is The final theorem provides the concrete privacy amplionly partitioned into two levels: tasks and examples. In ifcation that we need for our analysis. Before presenting multistage subsampling, the first level of partitions are this, we need to define when two distributions , ′ ∈ the primary sampling units and the final level is called P( ) are -compatible. Let be a coupling of , ′, the ultimate sampling units and this final level contains define (, ′) = (, supp( ′)) where (, ′) ∈ the examples we are ultimately interested in sampling. supp( ) and the distance between a point and supp( ′) For more details on multistage subsampling, see e.g., [13]. is defined to be the distance between and the closest point in supp( ′). 2.2. Diferential Privacy Theorem 2. Let (, ′) be the set of all couplings Since our analysis utilizes the tools of Balle et al. [ 7 ], we between and ′ and for ≥ 1 let = { ∈ supp( ) : introduce the necessary notations and definitions from (, supp( ′)) = }. If and ′ are -compatible, it. Let be an input space equipped with a binary sym- then the following holds: tbmroaeritinrniigcn girnedplaautttiaso.insF≃odrraowuthrnapftrudorempsocsarenisbd,etshteihseraecluaontniiocveneprwsteoiltflhbnaeetittghhhee- ∈m(i,n ′) ∑,︁′ ,′ ℳ, (,′)() = ∑≥ ︁1 () ℳ,() add-one/remove-one relation, thus two training sets are We are now equipped to begin an analysis of the prirelated if they difer by the addition or removal of one vacy amplification due to multistage subsampling. element.

Given a randomized algorithm or mechanism ℳ : → P(), where P() is the set of probability mea- 3. OUR APPROACH: Privacy sures on the output space , ℳ is (, )-diferentially Bounds for Multistage private w.r.t ≃ if for every pair ≃ ′ and every measurable subset ⊆ , Sampling Analysis We will begin the analysis with an example. Through this example, we will introduce the notation necessary for the general analysis.

Example 3.1. Let be a universe of 18 examples from which the database or training data is drawn from. Suppose we can categorize the data from the universe at 3 diferent levels, so we will perform a 3-stage sampling.

Let = 1 ∪ 2 = (11 ∪ 12 ∪ 13) ∪ (21 ∪ 22) = (︀ {111, 112, 113, 114} ∪ {121, 122} ∪ {131, 132, 133}︀) ∪ ︀( {211, 212, 213, 214} ∪ {221, 222, 223, 224, 225}︀)

Pr[ℳ( ) ∈ ] ≤ Pr[ℳ( ′) ∈ ] + . Utilizing the tools from [ 7 ] requires expressing diferential privacy in terms of -divergence given by ( || ′) := sup( () −

()) of two probability measures , ′ ∈ P(), where ranges over all measurable subsets of . Diefrential privacy can then be stated in terms of -divergence; specifically, a mechanism ℳ is (, )-diferentially private if and only if (ℳ( )||ℳ( ′)) ≤ for every adjacent datasets ≃ ′.

We can now define the privacy profile of a mechanism ℳ as ℳ = sup ≃ ′ (ℳ( )||ℳ( ′)), which associates each privacy parameter = with a bound on the -divergence between the results of the mechanism on two adjacent datasets.

Two theorems from [ 7 ] are important in our analysis.

The first is Advanced Joint Convexity, which we restate in terms of = since we are interested in applying this theorem to improve the privacy bounds due to multistage subsampling.

In this example, 1 for ∈ {1, 2} are the primary sampling units, the 12 are the ultimate sampling units and the 123 are the examples that would be provided to a training algorithm.

In general, let be a universe from which the training data is drawn and suppose a finite number of levels, , partition this universe. Define be the primary samTheorem 1. ([ 7 ], Advanced Joint Convexity of ) Let pling units and let 12· − 1 be the sampling units of , ′ ∈ P() be measures satisfying = (1 − ) 0 + the 12· − 2 unit. 12· − 1 is an ultimate sam 1 and ′ = (1 − ) 0 + ′1 for some , 0, 1, ′1. pling unit which contain the examples we are interest in sampling. Note that we require that each sampling unit be of finite size except the ultimate sampling units, which may be infinite. The multistage sampling procedure can be described by Algorithm 1: Multistage Sampling. Most episodic training procedures only use 2- or 3-stage sampling but we analyze the general case; which may have applications to other scientific domains (e.g. medical domains) where multistage sampling may have more levels.

Algorithm 1: Multistage Sampling

Set := ⋃︀ Set := ∅ Given : the number of units to be sampled at each level (1 ≤ ≤ ) for ∈ {1, ...., } do for ∈ PrevLevel do sample without replacement elements from add sampled elements to end end = := ∅

Now, let ⊂ be the training data or database we are analyzing. We will require that the training data has at least one element from each sampling unit described above. Thus we only allow the ultimate sampling units of the training data 12· − 1 ⊂ 12· − 1 , to be a non-empty finite subset of the ultimate sampling units with at least elements (i.e. at least the number of units that will be sampled from the ultimate sampling units). All other sampling units defined for the universe will remain the same for the training set.

We want to analyze the privacy bound on algorithms that use a multistage subsampling procedure on . To do this, we will apply the theorems from [ 7 ] and will analyze this sampling procedure under the add-one/remove one relation. We begin by defining a probability measure for this sampling procedure. We can do this by simply defining

Coupling Theorem from [7]. We just need to compute:

(, ′) = 1 − ∑︁ min ( (), ′()) ∈ Note we can easily extend our probability measures , ′ to the entire universe by setting the inclusion probability to 0 for any element not in or ′. For all elements ∈ ′ ∖ 12· − 1 , we have min( (), ′()) = () = ′(). Since 12· − 1 1 ̸∈ ′, we also have min( (12· − 1 1), ′(12· − 1 1)) = 0 . So we just need to consider the elements of the ultimate unit from which we removed an element. Since, we removed an element from this unit, the probability ′() > () since ′12· − 1 (the ultimate unit missing an element in ′) has fewer elements than 12· − 1 , therefore for all 12· − 1 ∈ ′12· − 1 and ̸= 1, we have (12· − 1 ) < ′(′12· − 1 ) where (12· − 1 ) = ′(′12· − 1 ) = ∏︀=1 ∏︀=1 |1 ||12 | · · · | 12· − 1 | |1 ||12 | · · · | ′12· − 1 | .

Thus

∈ ∑︁ min ( (), ′()) = ∑︁ () = 1− (12· − 1 1).

∈ ′ Hence the total variational distance is just the inclusion probability of the element we removed. Determining the total variational distance when adding an element from to is similar to the above argument.

We can now provide an amplified privacy bound for multistage subsampling.

Theorem 3. Let ℳ′ be a subsampled mechanism on described by Algorithm 1 and let 12 . . . − 1 be the index of the penultimate sampling unit that satisfies

min 1,2,...,− 1

(|1 ||12 | · · · | 12· − 1 |). ∏︀=1

Then, for any ≥ 0, we have that ℳ′ ( ′) ≤ ℳ′ ( ) (12· ) = |1 ||12 | · · · | 12· − 1 | for and = |1 ||12∏|︀·|=112· − 1 | and ′ = where 12· is in the ultimate unit 12· − 1 . (1 + ( − 1)) under the add-one/remove-one rela

Now consider ′ created by removing one element tion. from , say without loss of generality, 12· − 1 1 for some 1, 2, ..., − 1 . The probability measure ′ for sampling from ′ can be defined similar to above. We wish to compute the total variational distance between these two measure so that we can apply the Advanced

To fully complete the proof, let , ′ be training sets drawn from with ≃ ′ under the add-one/removeone relation ≃ and let ( ) denote the subsampling mechanism described by Algorithm 1 for = (, ′).

Let 0 = ∩ ′, then by definition of ≃, 0 = or 0 = ′. Let 0 = (0), = ( ) and ′ = ( ′). Then the decompositions of and ′ induced by their maximal coupling have that 1 = 0 when 0 = or ′1 = 0 when 0 = ′. We only need to consider 0 = ′ since this is when the maximum is obtained in applying advanced joint convexity. Finally, we note that one can easily create a ≃ -compatible pair according to the definition provided in [ 7 ] by first sampling from and building ′ by adding (which may be empty) to . Thus for each dataset pair, by Theorem 7 of [ 7 ], we have ′ (′) ≤ (). In order to get a bound for all possible training set pairs, we need to take = ( , ′)( ≃ ′ ). This occurs exacty when we remove an element from the penultimate unit with index 12 · − 1 which completes the proof.

We briefly mention how one might incorporate this new bound into a privacy accounting method. Many accounting methods, like the moments accountant [ 5 ], use the moment generating function in conjunction with the Gaussian mechanism to calculate the privacy bounds while a machine learning algorithm is training. Using Theorem 4 from [ 7 ] with our new bound one can easily derive a subsampled Gaussian that can be utilized in algorithms like those described in [ 5, 14 ].

4. Conclusion

This paper completely analyzes the privacy amplification due to multistage subsampling. This provides the correct privacy bounds for any algorithm that utilizes multistage subsampling, such as machine learning algorithms that use episodic training. Our future goal is to perform experiments to better understand privacy in machine learning algorithms that use episodic training like meta-learning algorithms. We hope our presented approach and discussion will prove useful to other researchers wanting to apply privacy bounds on multistage sampling in other studies and applications. [13] C.-E. Särndal, B. Swensson, J. Wretman, Model Assisted Survey Sampling, Springer-Verlag, 2003, p. 124–162. [14] I. Mironov, K. Talwar, L. Zhang, R\’enyi diferential privacy of the sampled gaussian mechanism (????). URL: http://arxiv.org/abs/1908.10530. arXiv:1908.10530.

[1]

Narayanan ,

Shmatikov , Robust deanonymization of large sparse datasets , in: 2008 IEEE Symposium on Security and Privacy (sp 2008 ), 2008 , pp. 111 - 125 . doi: 10 .1109/SP. 2008 . 33 .

[2]

Rocher ,

J. M.

Hendrickx , Y.-A. de Montjoye, Estimating the success of reidentifications in incomplete datasets using generative models 10 (????) 3069 . URL: https://doi.org/10.1038/s41467-019-10933-3. doi: 10 .1038/s41467-019-10933-3.

[3]

Dwork ,

Roth , The algorithmic foundations of diferential privacy , Found. Trends Theor. Comput. Sci. 9 ( 2014 ) 211 - 407 . URL: https://doi.org/10.1561/ 0400000042. doi: 10 .1561/0400000042.

[4]

Dwork , Diferential privacy: A survey of results , in: M. Agrawal , D.

Du , Z.

Duan , A . Li (Eds.), Theory and Applications of Models of Computation , Springer Berlin Heidelberg, Berlin, Heidelberg, 2008 , pp. 1 - 19 .

[5]

Abadi ,

Chu , I. Goodfellow, H. B. McMahan , I.

Mironov , K.

Talwar , L. Zhang,

Deep learning with diferential privacy , in: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security , CCS '16, Association for Computing Machinery, New York, NY, USA, 2016 , p. 308 - 318 . URL: https://doi.org/10.1145/2976749. 2978318. doi: 10 .1145/2976749.2978318.

[6]

Y.-X.

Wang ,

Balle ,

Kasiviswanathan , Subsampled rényi diferential privacy and analytical moments accountant , Journal of Privacy and Confidentiality 10 ( 2021 ). URL: https: //journalprivacyconfidentiality.org/index.php/jpc/ article/view/723. doi: 10 .29012/jpc.723.

[7]

Balle , G. Barthe,

Gaboardi , Privacy amplification by subsampling: Tight analyses via couplings and divergences , in: Proceedings of the 32nd International Conference on Neural Information Processing Systems , NIPS'18, Curran Associates Inc., Red

Hook

, NY , USA, 2018 , p. 6280 - 6290 .

[8]

Finn ,

Abbeel ,

Levine , Model-agnostic metalearning for fast adaptation of deep networks , in: D. Precup , Y. W. Teh (Eds.), Proceedings of the 34th International Conference on Machine Learning , volume 70 of Proceedings of Machine Learning Research, PMLR , 2017 , pp. 1126 - 1135 . URL: https://proceedings.mlr.press/v70/finn17a.html.

[9]

Ravi ,

Larochelle , Optimization as a model for few-shot learning , in: ICLR , 2017 .

[10]

Vinyals ,

Blundell , T. Lillicrap, k. kavukcuoglu, D. Wierstra, Matching networks for one shot learning , in: D. Lee , M.

Sugiyama , U.

Luxburg , I. Guyon , R. Garnett (Eds.), Advances in Neural Information Processing Systems , volume 29 , Curran

Associates

, Inc., 2016 . URL: https://proceedings.neurips.cc/paper/2016/file/ 90e1357833654983612fb05e3ec9148c-Paper.pdf .

[11]

Snell ,

Swersky ,

Zemel , Prototypical networks for few-shot learning , in: I. Guyon,

U. V.

Luxburg ,

Bengio ,

Wallach ,

Fergus ,

Vishwanathan , R. Garnett (Eds.), Advances in Neural Information Processing Systems , volume 30 , Curran

Associates

, Inc., 2017 . URL: https://proceedings.neurips.cc/paper/2017/file/ cb8da6767461f2812ae4290eac7cbc42-Paper.pdf .

[12]

Li ,

Zhang ,

Yang ,

Liu , Y.-

Song , T. Hospedales, Episodic training for domain generalization , in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019 , pp. 1446 - 1455 . doi: 10 .1109/ICCV. 2019 . 00153 .