=Paper= {{Paper |id=Vol-3318/short3 |storemode=property |title=Privacy Amplification for Episodic Training Methods |pdfUrl=https://ceur-ws.org/Vol-3318/short3.pdf |volume=Vol-3318 |authors=Vandy Tombs,Olivera Kotevska,Steven Young |dblpUrl=https://dblp.org/rec/conf/cikm/TombsKY22 }} ==Privacy Amplification for Episodic Training Methods== https://ceur-ws.org/Vol-3318/short3.pdf
Privacy Amplification for Episodic Training Methods
Vandy Tombs1 , Olivera Kotevska1 and Steven Young1
1
    Oak Ridge National Laboratory, 1 Bethel Valley Road, Oak Ridge, TN 37830


                                          Abstract
                                          It has been shown that differential privacy bounds improve when subsampling within a randomized mechanism. Episodic
                                          training, utilized in many standard machine learning techniques, uses a multistage subsampling procedure which has not
                                          been previously analyzed for privacy bound amplification. In this paper, we focus on improving the calculation of privacy
                                          bounds in episodic training by thoroughly analyzing privacy amplification due to subsampling with a multi-stage subsam-
                                          pling procedure. The newly developed bound can be incorporated into existing privacy accounting methods.


1. Introduction                                                                                  the privacy amplification due to subsampling as well as a
                                                                                                 complete analysis for Poisson and simple random subsam-
As more data is being utilized by algorithms and machine                                         pling both with and without replacement subsampling
learning techniques, rigorously maintaining the privacy                                          methods.
of this data has become important. Cyber security, health,                                          The subsampling methods analyzed previously include
and census data collection are all examples of fields that                                       many of the subsampling methods utilized by machine
are seeing increased scrutiny for ensuring the privacy                                           learning; however, the methods does not capture batches
of data, and it is well known that just anonymizing the                                          formed by algorithms that use episodic training. Episodic
data by removing features such as name is not sufficient                                         training methods are utilized by a variety of machine
to guarantee privacy due to vulnerabilities such as re-                                          learning algorithms, such as meta learning (e.g., [8, 9])
identification attacks, especially in the case when an                                           or metric learning (e.g., [10, 11]) algorithms. Domain
adversary has access to auxiliary knowledge or data (see                                         generalization algorithms have also frequently utilized
e.g. [1, 2]).                                                                                    episodic training [12].
   Differential privacy, first introduced by Dwork, is one                                          In this paper, we analyze the privacy amplification
technical definition of privacy that has been studied                                            due to the subsampling method utilized in an episodic
widely in the literature [3, 4]. This definition provides                                        training regime. Specifically, we notice forming batches
rigorous guarantees for the privacy of data that is uti-                                         in episodic training is a multistage subsampling method,
lized by an algorithm and has several nice properties like                                       and we provide a complete analysis of the improved dif-
robustness to post processing and strong composition                                             ferential privacy bounds when applying a mechanism to
theorems.                                                                                        a sample drawn using multistage subsampling. The re-
   Machine learning practitioners initially integrated dif-                                      sulting theorem can be easily applied to episodic training
ferential privacy by naively applying these composition                                          methods and integrated with privacy accounting meth-
theorems algorithm by assuming that the algorithm ac-                                            ods such as the moment’s accountant [5]. This bound
cessed the entire training set on each step of training.                                         can also be utilized by practitioners of other domains
Abadi et al. [5] noticed the data is subsampled into                                             that use multistage subsampling within their algorithms.
batches, so only a subset of the data is utilized for each
step of training. This allowed for improved privacy bounds;
however, they assumed that batches were created us- 2. Background and Related Works
ing Poisson sampling. Later authors showed improved
bounds for creating batches using simple random sam- 2.1. Multistage Subsampling
pling without replacement [6]. And most recently, Balle
                                                                In a multistage sampling procedure, the universe from
et al. [7] provided a fully unified theory for determining
                                                                which samples are drawn is partitioned. These partitions
This manuscript has been authored by UT-Battelle, LLC, under may contain the examples we are ultimately interested
contract DE-AC05-00OR22725 with the US Department of En- in sampling or may contain one or several levels of parti-
ergy (DOE). The publisher acknowledges the US government li- tions. The subsampling procedure is to sample partitions
cense to provide public access under the DOE Public Access Plan at each level until examples are sampled. For example,
(http://energy.gov/downloads/doe-public-access-plan).
Research sponsored by the Laboratory Directed Research and De-
                                                                if we are interested in the demographics of students at
velopment Program of Oak Ridge National Laboratory, managed by a school, we could partition students by teacher, sample
UT-Battelle, LLC, for the U. S. Department of Energy.           some number of teachers and then sample students from
                                                                each sampled teacher.
          Β© 2022 Author:Pleasefillinthe\copyrightclause macro
                                                                   To see that episodic training is a multistage subsam-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                                                                                        β€²
pling procedure, consider how training batches are formed     Given πœ€ > 0, let πœ€β€² = π‘™π‘œπ‘”(1+πœ‚(π‘’πœ€ βˆ’1)) and 𝛽 = π‘’πœ€ /π‘’πœ€ ,
in Algorithm 2 of [8]. In this work, a subset of tasks are    the following holds:
sampled from a collection of tasks, then the examples are
sampled and provided to the training algorithm. This is            π·π‘’πœ€β€² (πœ‡||πœ‡β€² ) = πœ‚π·π‘’πœ€ (πœ‡1 ||(1 βˆ’ 𝛽)πœ‡0 + π›½πœ‡β€²1 )
a 2-stage sampling procedure since the training data is          The final theorem provides the concrete privacy ampli-
only partitioned into two levels: tasks and examples. In      fication that we need for our analysis. Before presenting
multistage subsampling, the first level of partitions are     this, we need to define when two distributions 𝜐, 𝜐 β€² ∈
the primary sampling units and the final level is called      P(π‘Œ ) are π‘‘π‘Œ -compatible. Let πœ‹ be a coupling of 𝜐, 𝜐 β€² ,
the ultimate sampling units and this final level contains     define π‘‘π‘Œ (𝑦, 𝑦 β€² ) = π‘‘π‘Œ (𝑦, supp(𝜐 β€² )) where (𝑦, 𝑦 β€² ) ∈
the examples we are ultimately interested in sampling.        supp(πœ‹) and the distance between a point 𝑦 and supp(𝜐 β€² )
For more details on multistage subsampling, see e.g., [13].   is defined to be the distance between 𝑦 and the closest
                                                              point in supp(𝜐 β€² ).
2.2. Differential Privacy
                                                              Theorem 2. Let 𝐢(𝜐, 𝜐 β€² ) be the set of all couplings
Since our analysis utilizes the tools of Balle et al. [7], we between 𝜐 and 𝜐 β€² and for π‘˜ β‰₯ 1 let π‘Œπ‘˜ = {𝑦 ∈ supp(𝜐) :
introduce the necessary notations and definitions from π‘‘π‘Œ (𝑦, supp(𝜐 β€² )) = π‘˜}. If 𝜐 and 𝜐 β€² are π‘‘π‘Œ -compatible,
it. Let 𝒰 be an input space equipped with a binary sym- then the following holds:
metric relation ≃𝒰 that describes the concept of neigh-                 βˆ‘οΈ                              βˆ‘οΈ
boring inputs. For our purposes, 𝒰 is a universe that the πœ‹βˆˆπΆ(𝜐,𝜐
                                                                min β€²
                                                                      )
                                                                              πœ‹π‘¦,𝑦′ 𝛿ℳ,π‘‘π‘Œ (𝑦,𝑦′ ) (πœ€) =     𝜐(π‘Œπ‘˜ )𝛿ℳ,π‘˜ (πœ€)
training data is drawn from and the relation will be the                𝑦,𝑦 β€²                           π‘˜β‰₯1

add-one/remove-one relation, thus two training sets are         We are now equipped to begin an analysis of the pri-
related if they differ by the addition or removal of one vacy amplification due to multistage subsampling.
element.
    Given a randomized algorithm or mechanism β„³ :
𝑋 β†’ P(𝑍), where P(𝑍) is the set of probability mea- 3. OUR APPROACH: Privacy
sures on the output space 𝑍, β„³ is (πœ€, 𝛿)-differentially
private w.r.t ≃𝒰 if for every pair 𝒯 ≃𝒰 𝒯 β€² and every
                                                                  Bounds for Multistage
measurable subset 𝐸 βŠ† 𝑍,                                          Sampling Analysis
      Pr[β„³(𝒯 ) ∈ 𝐸] ≀ π‘’πœ€ Pr[β„³(𝒯 β€² ) ∈ 𝐸] + 𝛿.              We will begin the analysis with an example. Through
                                                           this example, we will introduce the notation necessary
Utilizing the tools from [7] requires expressing differen- for the general analysis.
tial privacy in terms of 𝛼-divergence given by
                                                           Example 3.1. Let 𝒰 be a universe of 18 examples from
           𝐷𝛼 (πœ‡||πœ‡β€² ) := sup(πœ‡(𝐸) βˆ’ π›Όπœ‡(𝐸))                which the database or training data is drawn from. Sup-
                           𝐸                               pose we can categorize the data from the universe at 3
of two probability measures πœ‡, πœ‡ ∈ P(𝑍), where 𝐸 different levels, so we will perform a 3-stage sampling.
                                    β€²

ranges over all measurable subsets of 𝑍. Differential pri- Let
vacy can then be stated in terms of 𝛼-divergence; specif-     𝒰 = π‘ˆ1 βˆͺ π‘ˆ2
ically, a mechanism β„³ is (πœ€, 𝛿)-differentially private if
                                                                 = (π‘ˆ11 βˆͺ π‘ˆ12 βˆͺ π‘ˆ13 ) βˆͺ (π‘ˆ21 βˆͺ π‘ˆ22 )
and only if π·π‘’πœ– (β„³(𝒯 )||β„³(𝒯 β€² )) ≀ 𝛿 for every adjacent
datasets 𝒯 ≃𝒰 𝒯 β€² .
                                                                   (οΈ€
                                                                 = {𝑒111 , 𝑒112 , 𝑒113 , 𝑒114 } βˆͺ {𝑒121 , 𝑒122 }
   We can now define the privacy profile of a mechanism                                   )οΈ€ (οΈ€
                                                                  βˆͺ {𝑒131 , 𝑒132 , 𝑒133 } βˆͺ {𝑒211 , 𝑒212 , 𝑒213 , 𝑒214 }
β„³ as 𝛿ℳ = sup𝒯 ≃𝒰 𝒯 β€² π·π‘’πœ– (β„³(𝒯 )||β„³(𝒯 β€² )), which                                                       )οΈ€
associates each privacy parameter 𝛼 = π‘’πœ€ with a bound             βˆͺ {𝑒221 , 𝑒222 , 𝑒223 , 𝑒224 , 𝑒225 }
on the 𝛼-divergence between the results of the mecha-     In this example, π‘ˆπ‘–1 for 𝑖 ∈ {1, 2} are the primary sam-
nism on two adjacent datasets.                            pling units, the π‘ˆπ‘–1 𝑖2 are the ultimate sampling units and
   Two theorems from [7] are important in our analysis.   the 𝑒𝑖1 𝑖2 𝑖3 are the examples that would be provided to a
The first is Advanced Joint Convexity, which we restate intraining algorithm.
terms of 𝛼 = π‘’πœ€ since we are interested in applying this     In general, let 𝒰 be a universe from which the training
theorem to improve the privacy bounds due to multistage   data is drawn and suppose a finite number of levels, 𝑁𝐿 ,
subsampling.                                              partition this universe. Define π‘ˆπ‘– be the primary sam-
Theorem 1. ([7], Advanced Joint Convexity of 𝐷𝑒 ) Let pling units and let π‘ˆπ‘–1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 be the sampling units of
                                                πœ€

πœ‡, πœ‡β€² ∈ P(𝑍) be measures satisfying πœ‡ = (1 βˆ’ πœ‚)πœ‡0 + the π‘ˆπ‘–1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’2 unit. π‘ˆπ‘–1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 is an ultimate sam-
πœ‚πœ‡1 and πœ‡β€² = (1 βˆ’ πœ‚)πœ‡0 + πœ‚πœ‡β€²1 for some πœ‚, πœ‡0 , πœ‡1 , πœ‡β€²1 . pling unit which contain the examples we are interest in
sampling. Note that we require that each sampling unit               Coupling Theorem from [7]. We just need to compute:
be of finite size except the ultimate sampling units, which                                   βˆ‘οΈ
may be infinite. The multistage sampling procedure can                     𝑇 𝑉 (πœ‡, πœ‡β€² ) = 1 βˆ’    min (πœ‡(𝑒), πœ‡β€² (𝑒))
be described by Algorithm 1: Multistage Sampling. Most                                                    π‘’βˆˆπ‘ˆ
episodic training procedures only use 2- or 3-stage sam-
pling but we analyze the general case; which may have                Note we can easily extend our probability measures πœ‡, πœ‡β€²
applications to other scientific domains (e.g. medical do-           to the entire universe by setting the inclusion probability
mains) where multistage sampling may have more levels.               to 0 for any element not in 𝑇 or 𝑇 β€² . For all elements
                                                                     𝑑 ∈ 𝒯 β€² βˆ– 𝑇𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 , we have min(πœ‡(𝑑), πœ‡β€² (𝑑)) =
                                                                     πœ‡(𝑑) = πœ‡β€² (𝑑). Since 𝑑𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 1 ̸∈ 𝒯 β€² , we also have
 Algorithm 1: Multistage Sampling
                                                                     min(πœ‡(𝑑𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 1 ), πœ‡β€² (𝑑𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 1 )) = 0 . So we
  Set 𝑃 π‘Ÿπ‘’π‘£πΏπ‘’π‘£π‘’π‘™ := π‘ˆπ‘–
                       ⋃︀
                                                                     just need to consider the elements of the ultimate unit
  Set πΆπ‘’π‘Ÿπ‘Ÿπ‘’π‘›π‘‘πΏπ‘’π‘£π‘’π‘™ := βˆ…
                                                                     from which we removed an element. Since, we removed
  Given 𝑛𝑗 : the number of units to be sampled at
                                                                     an element from this unit, the probability πœ‡β€² (𝑑) > πœ‡(𝑑)
   each level (1 ≀ 𝑗 ≀ 𝑁𝐿 )
                                                                     since 𝑇𝑖′1 𝑖2 ···𝑖𝑁  (the ultimate unit missing an element
  for 𝑗 ∈ {1, ...., 𝑁𝐿 } do                                                                 πΏβˆ’1

      for 𝑆 ∈ PrevLevel do                                           in 𝑇 β€² ) has fewer elements than 𝑇𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 , therefore
          sample without replacement 𝑛𝑗 elements                     for all 𝑑𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 𝑖 ∈ 𝑇𝑖′1 𝑖2 ···𝑖𝑁 and 𝑖 ΜΈ= 1, we have
           from 𝑆
                                                                                                                      πΏβˆ’1
                                                                     πœ‡(𝑑𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 𝑖 ) < πœ‡β€² (𝑑′𝑖1 𝑖2 ···𝑖𝑁                  𝑖 ) where
          add sampled elements to πΆπ‘’π‘Ÿπ‘Ÿπ‘’π‘›π‘‘πΏπ‘’π‘£π‘’π‘™                                                                        πΏβˆ’1

      end                                                                                                                  βˆοΈ€π‘πΏ
      𝑃 π‘Ÿπ‘’π‘£πΏπ‘’π‘£π‘’π‘™ = πΆπ‘’π‘Ÿπ‘Ÿπ‘’π‘›π‘‘πΏπ‘’π‘£π‘’π‘™                                                                                             𝑗=1 𝑛𝑗
                                                                       πœ‡(𝑑𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 𝑖 ) =
      πΆπ‘’π‘Ÿπ‘Ÿπ‘’π‘›π‘‘πΏπ‘’π‘£π‘’π‘™ := βˆ…                                                                                |π‘ˆπ‘–1 ||π‘ˆπ‘–1 𝑖2 | Β· Β· Β· |𝑇𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 |
  end                                                                                                               βˆοΈ€π‘πΏ
                                                                                                                        𝑗=1 𝑛𝑗
                                                                      πœ‡β€² (𝑑′𝑖1 𝑖2 ···𝑖𝑁       𝑖) =                                                         .
                                                                                       πΏβˆ’1             |π‘ˆπ‘–1 ||π‘ˆπ‘–1 𝑖2 | Β· Β· Β· |𝑇𝑖′1 𝑖2 ···𝑖𝑁            |
   Now, let 𝒯 βŠ‚ 𝒰 be the training data or database we                                                                                            πΏβˆ’1

are analyzing. We will require that the training data has            Thus
at least one element from each sampling unit described               βˆ‘οΈ                      βˆ‘οΈ
above. Thus we only allow the ultimate sampling units                   min (πœ‡(𝑒), πœ‡β€² (𝑒)) =    πœ‡(𝑑) = 1βˆ’πœ‡(𝑑𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 1 ).
of the training data 𝑇𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 βŠ‚ π‘ˆπ‘–1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 , to be       π‘’βˆˆπ’°                                      π‘‘βˆˆπ’― β€²
a non-empty finite subset of the ultimate sampling units
                                                                     Hence the total variational distance is just the inclusion
with at least 𝑛𝑁𝐿 elements (i.e. at least the number of
                                                                     probability of the element we removed. Determining the
units that will be sampled from the ultimate sampling
                                                                     total variational distance when adding an element from
units). All other sampling units defined for the universe
                                                                     𝒰 to 𝒯 is similar to the above argument.
will remain the same for the training set.
                                                                        We can now provide an amplified privacy bound for
   We want to analyze the privacy bound on algorithms
                                                                     multistage subsampling.
that use a multistage subsampling procedure on 𝒯 . To do
this, we will apply the theorems from [7] and will analyze           Theorem 3. Let β„³β€² be a subsampled mechanism on 𝒯
this sampling procedure under the add-one/remove one                 described by Algorithm 1 and let π‘š1 π‘š2 . . . π‘šπ‘πΏ βˆ’1 be
relation. We begin by defining a probability measure                 the index of the penultimate sampling unit that satisfies
for this sampling procedure. We can do this by simply
defining                                                                         min             (|π‘ˆπ‘–1 ||π‘ˆπ‘–1 𝑖2 | Β· Β· Β· |𝑇𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 |).
                                                                           𝑖1 ,𝑖2 ,...,𝑖𝑁
                                                                                          πΏβˆ’1
                                        βˆοΈ€π‘πΏ
                                           𝑗=1 𝑛𝑗                    Then, for any πœ– β‰₯ 0, we have that 𝛿ℳ′ (πœ–β€² ) ≀ πœ‚π›Ώβ„³β€² (πœ–)
    πœ‡(𝑑𝑖1 𝑖2 ···𝑖𝑁𝐿 ) =
                          |π‘ˆπ‘–1 ||π‘ˆπ‘–1 𝑖2 | Β· Β· Β· |𝑇𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 |                                        βˆοΈ€π‘πΏ
                                                                                                                      𝑛𝑗
                                                                     for and πœ‚ = |π‘ˆπ‘š ||π‘ˆπ‘š π‘š |Β·Β·Β·|𝑇
                                                                                              𝑗=1
                                                                                                  π‘š π‘š Β·Β·Β·π‘š                                  |
                                                                                                                                                and πœ–β€² =
                                                                                             1        1   2            1    2      π‘πΏβˆ’1
where 𝑑𝑖1 𝑖2 ···𝑖𝑁𝐿 is in the ultimate unit 𝑇𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 .        π‘™π‘œπ‘”(1 + πœ‚(π‘’πœ– βˆ’ 1)) under the add-one/remove-one rela-
   Now consider 𝒯 β€² created by removing one element                  tion.
from 𝒯 , say without loss of generality, 𝑑𝑖1 𝑖2 Β·Β·Β·π‘–π‘πΏβˆ’1 1 for
                                                                       To fully complete the proof, let 𝒯 , 𝒯 β€² be training sets
some 𝑖1 , 𝑖2 , ..., π‘–π‘πΏβˆ’1 . The probability measure πœ‡β€² for
                                                                     drawn from 𝒰 with 𝒯 β‰ƒπ‘Ÿ 𝒯 β€² under the add-one/remove-
sampling from 𝒯 β€² can be defined similar to above. We
                                                                     one relation β‰ƒπ‘Ÿ and let 𝑆𝛾 (𝒯 ) denote the subsampling
wish to compute the total variational distance between
                                                                     mechanism described by Algorithm 1 for 𝛾 = 𝑇 𝑉 (πœ‡, πœ‡β€² ).
these two measure so that we can apply the Advanced
                                                                     Let 𝒯0 = 𝒯 ∩ 𝒯 β€² , then by definition of β‰ƒπ‘Ÿ , 𝒯0 = 𝒯 or
                                                                     𝒯0 = 𝒯 β€² . Let πœ”0 = 𝑆𝛾 (𝒯0 ), πœ‡ = 𝑆𝛾 (𝒯 ) and πœ‡β€² =
𝑆𝛾 (𝒯 β€² ). Then the decompositions of πœ‡ and πœ‡β€² induced         [4] C. Dwork, Differential privacy: A survey of re-
by their maximal coupling have that πœ‡1 = πœ”0 when                   sults, in: M. Agrawal, D. Du, Z. Duan, A. Li (Eds.),
𝒯0 = 𝒯 or πœ‡β€²1 = πœ”0 when 𝒯0 = 𝒯 β€² . We only need to                 Theory and Applications of Models of Computa-
consider 𝒯0 = 𝒯 β€² since this is when the maximum is                tion, Springer Berlin Heidelberg, Berlin, Heidelberg,
obtained in applying advanced joint convexity. Finally,            2008, pp. 1–19.
we note that one can easily create a π‘‘β‰ƒπ‘Ÿ -compatible pair      [5] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan,
according to the definition provided in [7] by first sam-          I. Mironov, K. Talwar, L. Zhang, Deep learning
pling 𝑦 from πœ‡ and building 𝑦 β€² by adding 𝑣 (which may             with differential privacy, in: Proceedings of the
be empty) to 𝑦. Thus for each dataset pair, by Theorem             2016 ACM SIGSAC Conference on Computer and
7 of [7], we have 𝛿𝑀 β€² (πœ€β€² ) ≀ 𝛾𝛿𝑀 (πœ€). In order to get a          Communications Security, CCS ’16, Association for
bound for all possible training set pairs, we need to take         Computing Machinery, New York, NY, USA, 2016,
πœ‚ = π‘šπ‘–π‘›(𝒯 ,𝒯 β€² ) (𝛾𝒯 β‰ƒπ‘Ÿ 𝒯 β€² ). This occurs exacty when we          p. 308–318. URL: https://doi.org/10.1145/2976749.
remove an element from the penultimate unit with index             2978318. doi:10.1145/2976749.2978318.
π‘š1 π‘š2 Β· π‘šπ‘πΏ βˆ’1 which completes the proof.                      [6] Y.-X. Wang, B. Balle, S. Kasiviswanathan,
   We briefly mention how one might incorporate this               Subsampled rΓ©nyi differential privacy and
new bound into a privacy accounting method. Many                   analytical moments accountant,              Journal of
accounting methods, like the moments accountant [5],               Privacy and Confidentiality 10 (2021). URL: https:
use the moment generating function in conjunction with             //journalprivacyconfidentiality.org/index.php/jpc/
the Gaussian mechanism to calculate the privacy bounds             article/view/723. doi:10.29012/jpc.723.
while a machine learning algorithm is training. Using          [7] B. Balle, G. Barthe, M. Gaboardi, Privacy amplifica-
Theorem 4 from [7] with our new bound one can easily               tion by subsampling: Tight analyses via couplings
derive a subsampled Gaussian that can be utilized in               and divergences, in: Proceedings of the 32nd In-
algorithms like those described in [5, 14].                        ternational Conference on Neural Information Pro-
                                                                   cessing Systems, NIPS’18, Curran Associates Inc.,
                                                                   Red Hook, NY, USA, 2018, p. 6280–6290.
4. Conclusion                                                  [8] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-
                                                                   learning for fast adaptation of deep networks, in:
This paper completely analyzes the privacy amplification
                                                                   D. Precup, Y. W. Teh (Eds.), Proceedings of the
due to multistage subsampling. This provides the correct
                                                                   34th International Conference on Machine Learn-
privacy bounds for any algorithm that utilizes multistage
                                                                   ing, volume 70 of Proceedings of Machine Learn-
subsampling, such as machine learning algorithms that
                                                                   ing Research, PMLR, 2017, pp. 1126–1135. URL:
use episodic training. Our future goal is to perform exper-
                                                                   https://proceedings.mlr.press/v70/finn17a.html.
iments to better understand privacy in machine learning
                                                               [9] S. Ravi, H. Larochelle, Optimization as a model for
algorithms that use episodic training like meta-learning
                                                                   few-shot learning, in: ICLR, 2017.
algorithms. We hope our presented approach and discus-
                                                              [10] O. Vinyals, C. Blundell, T. Lillicrap, k. kavukcuoglu,
sion will prove useful to other researchers wanting to
                                                                   D. Wierstra, Matching networks for one shot learn-
apply privacy bounds on multistage sampling in other
                                                                   ing,     in: D. Lee, M. Sugiyama, U. Luxburg,
studies and applications.
                                                                   I. Guyon, R. Garnett (Eds.), Advances in
                                                                   Neural Information Processing Systems, vol-
References                                                         ume 29, Curran Associates, Inc., 2016. URL:
                                                                   https://proceedings.neurips.cc/paper/2016/file/
 [1] A. Narayanan, V. Shmatikov,             Robust de-            90e1357833654983612fb05e3ec9148c-Paper.pdf.
     anonymization of large sparse datasets, in: 2008         [11] J. Snell, K. Swersky, R. Zemel, Prototypical
     IEEE Symposium on Security and Privacy (sp 2008),             networks for few-shot learning, in: I. Guyon,
     2008, pp. 111–125. doi:10.1109/SP.2008.33.                    U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,
 [2] L. Rocher, J. M. Hendrickx, Y.-A. de Mon-                     S. Vishwanathan, R. Garnett (Eds.), Advances
     tjoye,         Estimating the success of re-                  in Neural Information Processing Systems,
     identifications in incomplete datasets using                  volume 30, Curran Associates, Inc., 2017. URL:
     generative models 10 (????)             3069. URL:            https://proceedings.neurips.cc/paper/2017/file/
     https://doi.org/10.1038/s41467-019-10933-3.                   cb8da6767461f2812ae4290eac7cbc42-Paper.pdf.
     doi:10.1038/s41467-019-10933-3.                          [12] D. Li, J. Zhang, Y. Yang, C. Liu, Y.-Z. Song,
 [3] C. Dwork, A. Roth, The algorithmic foundations of             T. Hospedales, Episodic training for domain gener-
     differential privacy, Found. Trends Theor. Comput.            alization, in: 2019 IEEE/CVF International Confer-
     Sci. 9 (2014) 211–407. URL: https://doi.org/10.1561/          ence on Computer Vision (ICCV), 2019, pp. 1446–
     0400000042. doi:10.1561/0400000042.                           1455. doi:10.1109/ICCV.2019.00153.
[13] C.-E. SΓ€rndal, B. Swensson, J. Wretman, Model As-
     sisted Survey Sampling, Springer-Verlag, 2003, p.
     124–162.
[14] I. Mironov, K. Talwar, L. Zhang, R\’enyi differ-
     ential privacy of the sampled gaussian mecha-
     nism (????). URL: http://arxiv.org/abs/1908.10530.
     arXiv:1908.10530.