=Paper=
{{Paper
|id=Vol-3318/short3
|storemode=property
|title=Privacy Amplification for Episodic Training Methods
|pdfUrl=https://ceur-ws.org/Vol-3318/short3.pdf
|volume=Vol-3318
|authors=Vandy Tombs,Olivera Kotevska,Steven Young
|dblpUrl=https://dblp.org/rec/conf/cikm/TombsKY22
}}
==Privacy Amplification for Episodic Training Methods==
Privacy Amplification for Episodic Training Methods Vandy Tombs1 , Olivera Kotevska1 and Steven Young1 1 Oak Ridge National Laboratory, 1 Bethel Valley Road, Oak Ridge, TN 37830 Abstract It has been shown that differential privacy bounds improve when subsampling within a randomized mechanism. Episodic training, utilized in many standard machine learning techniques, uses a multistage subsampling procedure which has not been previously analyzed for privacy bound amplification. In this paper, we focus on improving the calculation of privacy bounds in episodic training by thoroughly analyzing privacy amplification due to subsampling with a multi-stage subsam- pling procedure. The newly developed bound can be incorporated into existing privacy accounting methods. 1. Introduction the privacy amplification due to subsampling as well as a complete analysis for Poisson and simple random subsam- As more data is being utilized by algorithms and machine pling both with and without replacement subsampling learning techniques, rigorously maintaining the privacy methods. of this data has become important. Cyber security, health, The subsampling methods analyzed previously include and census data collection are all examples of fields that many of the subsampling methods utilized by machine are seeing increased scrutiny for ensuring the privacy learning; however, the methods does not capture batches of data, and it is well known that just anonymizing the formed by algorithms that use episodic training. Episodic data by removing features such as name is not sufficient training methods are utilized by a variety of machine to guarantee privacy due to vulnerabilities such as re- learning algorithms, such as meta learning (e.g., [8, 9]) identification attacks, especially in the case when an or metric learning (e.g., [10, 11]) algorithms. Domain adversary has access to auxiliary knowledge or data (see generalization algorithms have also frequently utilized e.g. [1, 2]). episodic training [12]. Differential privacy, first introduced by Dwork, is one In this paper, we analyze the privacy amplification technical definition of privacy that has been studied due to the subsampling method utilized in an episodic widely in the literature [3, 4]. This definition provides training regime. Specifically, we notice forming batches rigorous guarantees for the privacy of data that is uti- in episodic training is a multistage subsampling method, lized by an algorithm and has several nice properties like and we provide a complete analysis of the improved dif- robustness to post processing and strong composition ferential privacy bounds when applying a mechanism to theorems. a sample drawn using multistage subsampling. The re- Machine learning practitioners initially integrated dif- sulting theorem can be easily applied to episodic training ferential privacy by naively applying these composition methods and integrated with privacy accounting meth- theorems algorithm by assuming that the algorithm ac- ods such as the momentβs accountant [5]. This bound cessed the entire training set on each step of training. can also be utilized by practitioners of other domains Abadi et al. [5] noticed the data is subsampled into that use multistage subsampling within their algorithms. batches, so only a subset of the data is utilized for each step of training. This allowed for improved privacy bounds; however, they assumed that batches were created us- 2. Background and Related Works ing Poisson sampling. Later authors showed improved bounds for creating batches using simple random sam- 2.1. Multistage Subsampling pling without replacement [6]. And most recently, Balle In a multistage sampling procedure, the universe from et al. [7] provided a fully unified theory for determining which samples are drawn is partitioned. These partitions This manuscript has been authored by UT-Battelle, LLC, under may contain the examples we are ultimately interested contract DE-AC05-00OR22725 with the US Department of En- in sampling or may contain one or several levels of parti- ergy (DOE). The publisher acknowledges the US government li- tions. The subsampling procedure is to sample partitions cense to provide public access under the DOE Public Access Plan at each level until examples are sampled. For example, (http://energy.gov/downloads/doe-public-access-plan). Research sponsored by the Laboratory Directed Research and De- if we are interested in the demographics of students at velopment Program of Oak Ridge National Laboratory, managed by a school, we could partition students by teacher, sample UT-Battelle, LLC, for the U. S. Department of Energy. some number of teachers and then sample students from each sampled teacher. Β© 2022 Author:Pleasefillinthe\copyrightclause macro To see that episodic training is a multistage subsam- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) β² pling procedure, consider how training batches are formed Given π > 0, let πβ² = πππ(1+π(ππ β1)) and π½ = ππ /ππ , in Algorithm 2 of [8]. In this work, a subset of tasks are the following holds: sampled from a collection of tasks, then the examples are sampled and provided to the training algorithm. This is π·ππβ² (π||πβ² ) = ππ·ππ (π1 ||(1 β π½)π0 + π½πβ²1 ) a 2-stage sampling procedure since the training data is The final theorem provides the concrete privacy ampli- only partitioned into two levels: tasks and examples. In fication that we need for our analysis. Before presenting multistage subsampling, the first level of partitions are this, we need to define when two distributions π, π β² β the primary sampling units and the final level is called P(π ) are ππ -compatible. Let π be a coupling of π, π β² , the ultimate sampling units and this final level contains define ππ (π¦, π¦ β² ) = ππ (π¦, supp(π β² )) where (π¦, π¦ β² ) β the examples we are ultimately interested in sampling. supp(π) and the distance between a point π¦ and supp(π β² ) For more details on multistage subsampling, see e.g., [13]. is defined to be the distance between π¦ and the closest point in supp(π β² ). 2.2. Differential Privacy Theorem 2. Let πΆ(π, π β² ) be the set of all couplings Since our analysis utilizes the tools of Balle et al. [7], we between π and π β² and for π β₯ 1 let ππ = {π¦ β supp(π) : introduce the necessary notations and definitions from ππ (π¦, supp(π β² )) = π}. If π and π β² are ππ -compatible, it. Let π° be an input space equipped with a binary sym- then the following holds: metric relation βπ° that describes the concept of neigh- βοΈ βοΈ boring inputs. For our purposes, π° is a universe that the πβπΆ(π,π min β² ) ππ¦,π¦β² πΏβ³,ππ (π¦,π¦β² ) (π) = π(ππ )πΏβ³,π (π) training data is drawn from and the relation will be the π¦,π¦ β² πβ₯1 add-one/remove-one relation, thus two training sets are We are now equipped to begin an analysis of the pri- related if they differ by the addition or removal of one vacy amplification due to multistage subsampling. element. Given a randomized algorithm or mechanism β³ : π β P(π), where P(π) is the set of probability mea- 3. OUR APPROACH: Privacy sures on the output space π, β³ is (π, πΏ)-differentially private w.r.t βπ° if for every pair π― βπ° π― β² and every Bounds for Multistage measurable subset πΈ β π, Sampling Analysis Pr[β³(π― ) β πΈ] β€ ππ Pr[β³(π― β² ) β πΈ] + πΏ. We will begin the analysis with an example. Through this example, we will introduce the notation necessary Utilizing the tools from [7] requires expressing differen- for the general analysis. tial privacy in terms of πΌ-divergence given by Example 3.1. Let π° be a universe of 18 examples from π·πΌ (π||πβ² ) := sup(π(πΈ) β πΌπ(πΈ)) which the database or training data is drawn from. Sup- πΈ pose we can categorize the data from the universe at 3 of two probability measures π, π β P(π), where πΈ different levels, so we will perform a 3-stage sampling. β² ranges over all measurable subsets of π. Differential pri- Let vacy can then be stated in terms of πΌ-divergence; specif- π° = π1 βͺ π2 ically, a mechanism β³ is (π, πΏ)-differentially private if = (π11 βͺ π12 βͺ π13 ) βͺ (π21 βͺ π22 ) and only if π·ππ (β³(π― )||β³(π― β² )) β€ πΏ for every adjacent datasets π― βπ° π― β² . (οΈ = {π’111 , π’112 , π’113 , π’114 } βͺ {π’121 , π’122 } We can now define the privacy profile of a mechanism )οΈ (οΈ βͺ {π’131 , π’132 , π’133 } βͺ {π’211 , π’212 , π’213 , π’214 } β³ as πΏβ³ = supπ― βπ° π― β² π·ππ (β³(π― )||β³(π― β² )), which )οΈ associates each privacy parameter πΌ = ππ with a bound βͺ {π’221 , π’222 , π’223 , π’224 , π’225 } on the πΌ-divergence between the results of the mecha- In this example, ππ1 for π β {1, 2} are the primary sam- nism on two adjacent datasets. pling units, the ππ1 π2 are the ultimate sampling units and Two theorems from [7] are important in our analysis. the π’π1 π2 π3 are the examples that would be provided to a The first is Advanced Joint Convexity, which we restate intraining algorithm. terms of πΌ = ππ since we are interested in applying this In general, let π° be a universe from which the training theorem to improve the privacy bounds due to multistage data is drawn and suppose a finite number of levels, ππΏ , subsampling. partition this universe. Define ππ be the primary sam- Theorem 1. ([7], Advanced Joint Convexity of π·π ) Let pling units and let ππ1 π2 Β·Β·Β·πππΏβ1 be the sampling units of π π, πβ² β P(π) be measures satisfying π = (1 β π)π0 + the ππ1 π2 Β·Β·Β·πππΏβ2 unit. ππ1 π2 Β·Β·Β·πππΏβ1 is an ultimate sam- ππ1 and πβ² = (1 β π)π0 + ππβ²1 for some π, π0 , π1 , πβ²1 . pling unit which contain the examples we are interest in sampling. Note that we require that each sampling unit Coupling Theorem from [7]. We just need to compute: be of finite size except the ultimate sampling units, which βοΈ may be infinite. The multistage sampling procedure can π π (π, πβ² ) = 1 β min (π(π’), πβ² (π’)) be described by Algorithm 1: Multistage Sampling. Most π’βπ episodic training procedures only use 2- or 3-stage sam- pling but we analyze the general case; which may have Note we can easily extend our probability measures π, πβ² applications to other scientific domains (e.g. medical do- to the entire universe by setting the inclusion probability mains) where multistage sampling may have more levels. to 0 for any element not in π or π β² . For all elements π‘ β π― β² β ππ1 π2 Β·Β·Β·πππΏβ1 , we have min(π(π‘), πβ² (π‘)) = π(π‘) = πβ² (π‘). Since π‘π1 π2 Β·Β·Β·πππΏβ1 1 ΜΈβ π― β² , we also have Algorithm 1: Multistage Sampling min(π(π‘π1 π2 Β·Β·Β·πππΏβ1 1 ), πβ² (π‘π1 π2 Β·Β·Β·πππΏβ1 1 )) = 0 . So we Set π πππ£πΏππ£ππ := ππ βοΈ just need to consider the elements of the ultimate unit Set πΆπ’πππππ‘πΏππ£ππ := β from which we removed an element. Since, we removed Given ππ : the number of units to be sampled at an element from this unit, the probability πβ² (π‘) > π(π‘) each level (1 β€ π β€ ππΏ ) since ππβ²1 π2 Β·Β·Β·ππ (the ultimate unit missing an element for π β {1, ...., ππΏ } do πΏβ1 for π β PrevLevel do in π β² ) has fewer elements than ππ1 π2 Β·Β·Β·πππΏβ1 , therefore sample without replacement ππ elements for all π‘π1 π2 Β·Β·Β·πππΏβ1 π β ππβ²1 π2 Β·Β·Β·ππ and π ΜΈ= 1, we have from π πΏβ1 π(π‘π1 π2 Β·Β·Β·πππΏβ1 π ) < πβ² (π‘β²π1 π2 Β·Β·Β·ππ π ) where add sampled elements to πΆπ’πππππ‘πΏππ£ππ πΏβ1 end βοΈππΏ π πππ£πΏππ£ππ = πΆπ’πππππ‘πΏππ£ππ π=1 ππ π(π‘π1 π2 Β·Β·Β·πππΏβ1 π ) = πΆπ’πππππ‘πΏππ£ππ := β |ππ1 ||ππ1 π2 | Β· Β· Β· |ππ1 π2 Β·Β·Β·πππΏβ1 | end βοΈππΏ π=1 ππ πβ² (π‘β²π1 π2 Β·Β·Β·ππ π) = . πΏβ1 |ππ1 ||ππ1 π2 | Β· Β· Β· |ππβ²1 π2 Β·Β·Β·ππ | Now, let π― β π° be the training data or database we πΏβ1 are analyzing. We will require that the training data has Thus at least one element from each sampling unit described βοΈ βοΈ above. Thus we only allow the ultimate sampling units min (π(π’), πβ² (π’)) = π(π‘) = 1βπ(π‘π1 π2 Β·Β·Β·πππΏβ1 1 ). of the training data ππ1 π2 Β·Β·Β·πππΏβ1 β ππ1 π2 Β·Β·Β·πππΏβ1 , to be π’βπ° π‘βπ― β² a non-empty finite subset of the ultimate sampling units Hence the total variational distance is just the inclusion with at least πππΏ elements (i.e. at least the number of probability of the element we removed. Determining the units that will be sampled from the ultimate sampling total variational distance when adding an element from units). All other sampling units defined for the universe π° to π― is similar to the above argument. will remain the same for the training set. We can now provide an amplified privacy bound for We want to analyze the privacy bound on algorithms multistage subsampling. that use a multistage subsampling procedure on π― . To do this, we will apply the theorems from [7] and will analyze Theorem 3. Let β³β² be a subsampled mechanism on π― this sampling procedure under the add-one/remove one described by Algorithm 1 and let π1 π2 . . . πππΏ β1 be relation. We begin by defining a probability measure the index of the penultimate sampling unit that satisfies for this sampling procedure. We can do this by simply defining min (|ππ1 ||ππ1 π2 | Β· Β· Β· |ππ1 π2 Β·Β·Β·πππΏβ1 |). π1 ,π2 ,...,ππ πΏβ1 βοΈππΏ π=1 ππ Then, for any π β₯ 0, we have that πΏβ³β² (πβ² ) β€ ππΏβ³β² (π) π(π‘π1 π2 Β·Β·Β·πππΏ ) = |ππ1 ||ππ1 π2 | Β· Β· Β· |ππ1 π2 Β·Β·Β·πππΏβ1 | βοΈππΏ ππ for and π = |ππ ||ππ π |Β·Β·Β·|π π=1 π π Β·Β·Β·π | and πβ² = 1 1 2 1 2 ππΏβ1 where π‘π1 π2 Β·Β·Β·πππΏ is in the ultimate unit ππ1 π2 Β·Β·Β·πππΏβ1 . πππ(1 + π(ππ β 1)) under the add-one/remove-one rela- Now consider π― β² created by removing one element tion. from π― , say without loss of generality, π‘π1 π2 Β·Β·Β·πππΏβ1 1 for To fully complete the proof, let π― , π― β² be training sets some π1 , π2 , ..., πππΏβ1 . The probability measure πβ² for drawn from π° with π― βπ π― β² under the add-one/remove- sampling from π― β² can be defined similar to above. We one relation βπ and let ππΎ (π― ) denote the subsampling wish to compute the total variational distance between mechanism described by Algorithm 1 for πΎ = π π (π, πβ² ). these two measure so that we can apply the Advanced Let π―0 = π― β© π― β² , then by definition of βπ , π―0 = π― or π―0 = π― β² . Let π0 = ππΎ (π―0 ), π = ππΎ (π― ) and πβ² = ππΎ (π― β² ). Then the decompositions of π and πβ² induced [4] C. Dwork, Differential privacy: A survey of re- by their maximal coupling have that π1 = π0 when sults, in: M. Agrawal, D. Du, Z. Duan, A. Li (Eds.), π―0 = π― or πβ²1 = π0 when π―0 = π― β² . We only need to Theory and Applications of Models of Computa- consider π―0 = π― β² since this is when the maximum is tion, Springer Berlin Heidelberg, Berlin, Heidelberg, obtained in applying advanced joint convexity. Finally, 2008, pp. 1β19. we note that one can easily create a πβπ -compatible pair [5] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, according to the definition provided in [7] by first sam- I. Mironov, K. Talwar, L. Zhang, Deep learning pling π¦ from π and building π¦ β² by adding π£ (which may with differential privacy, in: Proceedings of the be empty) to π¦. Thus for each dataset pair, by Theorem 2016 ACM SIGSAC Conference on Computer and 7 of [7], we have πΏπ β² (πβ² ) β€ πΎπΏπ (π). In order to get a Communications Security, CCS β16, Association for bound for all possible training set pairs, we need to take Computing Machinery, New York, NY, USA, 2016, π = πππ(π― ,π― β² ) (πΎπ― βπ π― β² ). This occurs exacty when we p. 308β318. URL: https://doi.org/10.1145/2976749. remove an element from the penultimate unit with index 2978318. doi:10.1145/2976749.2978318. π1 π2 Β· πππΏ β1 which completes the proof. [6] Y.-X. Wang, B. Balle, S. Kasiviswanathan, We briefly mention how one might incorporate this Subsampled rΓ©nyi differential privacy and new bound into a privacy accounting method. Many analytical moments accountant, Journal of accounting methods, like the moments accountant [5], Privacy and Confidentiality 10 (2021). URL: https: use the moment generating function in conjunction with //journalprivacyconfidentiality.org/index.php/jpc/ the Gaussian mechanism to calculate the privacy bounds article/view/723. doi:10.29012/jpc.723. while a machine learning algorithm is training. Using [7] B. Balle, G. Barthe, M. Gaboardi, Privacy amplifica- Theorem 4 from [7] with our new bound one can easily tion by subsampling: Tight analyses via couplings derive a subsampled Gaussian that can be utilized in and divergences, in: Proceedings of the 32nd In- algorithms like those described in [5, 14]. ternational Conference on Neural Information Pro- cessing Systems, NIPSβ18, Curran Associates Inc., Red Hook, NY, USA, 2018, p. 6280β6290. 4. Conclusion [8] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta- learning for fast adaptation of deep networks, in: This paper completely analyzes the privacy amplification D. Precup, Y. W. Teh (Eds.), Proceedings of the due to multistage subsampling. This provides the correct 34th International Conference on Machine Learn- privacy bounds for any algorithm that utilizes multistage ing, volume 70 of Proceedings of Machine Learn- subsampling, such as machine learning algorithms that ing Research, PMLR, 2017, pp. 1126β1135. URL: use episodic training. Our future goal is to perform exper- https://proceedings.mlr.press/v70/finn17a.html. iments to better understand privacy in machine learning [9] S. Ravi, H. Larochelle, Optimization as a model for algorithms that use episodic training like meta-learning few-shot learning, in: ICLR, 2017. algorithms. We hope our presented approach and discus- [10] O. Vinyals, C. Blundell, T. Lillicrap, k. kavukcuoglu, sion will prove useful to other researchers wanting to D. Wierstra, Matching networks for one shot learn- apply privacy bounds on multistage sampling in other ing, in: D. Lee, M. Sugiyama, U. Luxburg, studies and applications. I. Guyon, R. Garnett (Eds.), Advances in Neural Information Processing Systems, vol- References ume 29, Curran Associates, Inc., 2016. URL: https://proceedings.neurips.cc/paper/2016/file/ [1] A. Narayanan, V. Shmatikov, Robust de- 90e1357833654983612fb05e3ec9148c-Paper.pdf. anonymization of large sparse datasets, in: 2008 [11] J. Snell, K. Swersky, R. Zemel, Prototypical IEEE Symposium on Security and Privacy (sp 2008), networks for few-shot learning, in: I. Guyon, 2008, pp. 111β125. doi:10.1109/SP.2008.33. U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, [2] L. Rocher, J. M. Hendrickx, Y.-A. de Mon- S. Vishwanathan, R. Garnett (Eds.), Advances tjoye, Estimating the success of re- in Neural Information Processing Systems, identifications in incomplete datasets using volume 30, Curran Associates, Inc., 2017. URL: generative models 10 (????) 3069. URL: https://proceedings.neurips.cc/paper/2017/file/ https://doi.org/10.1038/s41467-019-10933-3. cb8da6767461f2812ae4290eac7cbc42-Paper.pdf. doi:10.1038/s41467-019-10933-3. [12] D. Li, J. Zhang, Y. Yang, C. Liu, Y.-Z. Song, [3] C. Dwork, A. Roth, The algorithmic foundations of T. Hospedales, Episodic training for domain gener- differential privacy, Found. Trends Theor. Comput. alization, in: 2019 IEEE/CVF International Confer- Sci. 9 (2014) 211β407. URL: https://doi.org/10.1561/ ence on Computer Vision (ICCV), 2019, pp. 1446β 0400000042. doi:10.1561/0400000042. 1455. doi:10.1109/ICCV.2019.00153. [13] C.-E. SΓ€rndal, B. Swensson, J. Wretman, Model As- sisted Survey Sampling, Springer-Verlag, 2003, p. 124β162. [14] I. Mironov, K. Talwar, L. Zhang, R\βenyi differ- ential privacy of the sampled gaussian mecha- nism (????). URL: http://arxiv.org/abs/1908.10530. arXiv:1908.10530.