<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Training Differentially Private Ad Prediction Models With Semi-Sensitive Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lynn Chua</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qiliang Cui</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Badih Ghazi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Charlie Harrison</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pritish Kamath</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Walid Krichene</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ravi Kumar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pasin Manurangsi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Mayoraz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krishna Giri Narra</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Steffen Rendle</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amer Sinha</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Avinash Varadarajan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chiyuan Zhang</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Motivated by problems arising in digital advertising, we study the task of training differentially private (DP) machine learning models with semi-sensitive features. In this setting, a subset of the features is known to the attacker (and thus need not be protected) while the remaining features as well as the label are unknown to the attacker and should be protected by the DP guarantee. This task interpolates between training the model with full DP (where the label and all features should be protected) or with label DP (where all the features are considered known, and only the label should be protected). We present a new algorithm for training DP models with semi-sensitive features. Through an empirical evaluation on ads datasets, we demonstrate that our algorithm surpasses in utility the baselines of (i) DP stochastic gradient descent (DP-SGD) run on all features (known and unknown), and (ii) a label DP algorithm run only on the known features (while discarding the unknown ones).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Differential privacy</kwd>
        <kwd>model training</kwd>
        <kwd>ad models</kwd>
        <kwd>semi-sensitive features</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, large-scale machine learning (ML)
algorithms have been adopted and deployed for different ad
modeling tasks, including the training of predicted click-through
rates (a.k.a. pCTR) and predicted conversion rates (a.k.a.
pCVR) models. Roughly speaking, pCTR models predict the
likelihood that an ad shown to a user is clicked, and pCVR
models predict the likelihood that an ad clicked (or viewed)
by the user leads to a conversion—which is defined as a
desirable action by the user on the advertiser site or app, such
as the purchase of the advertised product.</p>
      <p>
        Heightened user expectations around privacy have led
different web browsers (including Apple Safari [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Mozilla
Firefox [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and Google Chrome [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) to the deprecation of
third-party cookies (3PC), which are cross-site identifiers that
had hitherto allowed the joining in the clear of the datasets
on which the pCTR and pCVR models are trained. More
precisely, 3PCs previously allowed determining the
conversion label for pCVR models as well as the construction of
features (for pCTR and pCVR models) that depend on the
user’s behavior on sites other than the publisher where the
ad was shown.
      </p>
      <p>
        In order to support essential web functionalities that are
affected by the deprecation of 3PCs, different web browsers
have been building privacy-preserving APIs, including for
ads measurement and modeling such as the Interoperable
Private Attribution (IPA) developed by Mozilla and Meta [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
Masked LARK from Microsoft [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ], the Attribution
Reporting API, available on both the Chrome browser [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and the
Android operating system [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and the Private Click
Measurement (PCM) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Private Ad Measurement (PAM) APIs
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] from Apple. The privacy guarantees of several of these
APIs rely on differential privacy (DP) [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ], which is a
strong and robust notion of privacy that has in recent years
gained significant popularity for data analytics and modeling
tasks.
      </p>
      <p>
        Different DP variants have been studied in the context of
supervised ML, depending on the adjacency definition. The
standard definition of DP protects the full training example
(features and label) and has been extensively studied, e.g.,
Abadi et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. On the other hand, Label DP (e.g.,
Chaudhuri and Hsu [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Ghazi et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], Malek Esmaeili et al.
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) is a variant that only protects the label of each training
example, and is thus suitable in settings where the adversary
already has access to the features.
      </p>
      <p>
        Label DP is a natural fit for the case where the features of
the pCVR problem do not depend on cross-site information.
However, a common setting, including that of the Protected
Audience API on Chrome [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and Android [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], is where
some features depend on cross-site information whereas the
remaining features do not. An example is the remarketing
use case where a feature could indicate whether the same
user previously expressed interest in the advertised product
(e.g., added it to their cart) but did not purchase it. Revealing
a row of the database that has both contextual features (e.g.,
the publisher site, or the time of day the ad was served)
and features derived on the advertiser (e.g., user presence
on a particular remarketing list) could allow an attacker to
track a user across sites. In the Protected Audience API,
these sensitive features are protected by multiple privacy
mechanisms including feature-level randomized response.
      </p>
      <p>
        The focus of this work is to analyze this setting from the
DP perspective; we refer to it as DP model training with
semi-sensitive features. We formalize this setting, present an
algorithm for training private ML models with semi-sensitive
features, and evaluate it on real ad prediction datasets,
showing that it compares favorably to natural baselines.1 We
report the effect of certain important parameters on utility,
and also study the trade-offs between the size of the private
model and its quality – this is motivated by practical
settings, in which the private ML model training may happen in
Trusted Execution Environments with limited memory.
Two related notions of private model training with partially
private features were recently proposed, although they differ
slightly in their adjacency definitions (and hence in what is
AdKDD ’24, August 2024, Barcelona, Spain
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
1A preliminary version of this paper was presented at PPAI-24: The 5th
AAAI Workshop on Privacy-Preserving Artificial Intelligence.
considered public information): Krichene et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] propose
a stronger notion in which the adversary is only assumed to
know the set of distinct values that the non-sensitive features
may take, for example the feature values of all possible ads,
(while in our notion, we assume the adversary knows which
specific values appeared in the dataset, together with their
counts). And in a concurrent work [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], an algorithm based
on AdaBoost was proposed under a privacy notion similar to
ours; a key difference is that in their setting, the labels are
considered public.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. DP Training with Semi-Sensitive</title>
    </sec>
    <sec id="sec-3">
      <title>Features</title>
      <p>We consider the setting of supervised learning, where we
assume an underlying (unknown) distribution  over  ×  ,
where  denotes the set of possible inputs and  denotes
the set of possible labels. In this work, we focus on the
binary classification setting where  = {0, 1}. Our goal
is to learn a predictor  :  → R that maps the input
space  to R with the goal of minimizing the expected loss
ℒ( ; ) := E(,)∼ ℓ( (), ), where ℓ(· , · ) is a suitable
loss function, e.g., the binary cross entropy loss.</p>
      <p>To capture the setting of semi-sensitive features, let  =
 ∘ × ∙ , where  ∘ is the set of possible nonsensitive feature
values, and  ∙ is the set of possible sensitive feature values.
We denote a dataset as  = ((∘ , ∙ , ))∈[], where ∘
denotes the nonsensitive feature value, ∙ is the sensitive
feature value, and  is the corresponding (sensitive) label.
We use  to denote (∘ , ∙ ) for short. Some problems that
motivate the setting above are in ads modeling tasks, where
the features can include nonsensitive features such as the
browser class, publisher website, category of the mobile app
etc., sensitive features such as how long ago and how many
times a user showed interest in an advertised product etc.,
and sensitive labels such as whether the user converted on
the ad.</p>
      <p>
        We say that two datasets , ′ are adjacent, denoted
 ∼ ′ if one dataset can be obtained from the other by
changing the sensitive features and/or the label for a single
example, namely replacing (∘ , ∙ , ) with (∘ , ˜∙ , ˜) for
some (˜∙ , ˜) ∈  ∙ ×  . Note in particular that ∘ are
not allowed to change in the adjacent dataset, and should be
considered known to the adversary.2
Definition 1 (DP; Dwork et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]). For ,  ≥ 0, a
randomized mechanism ℳ satisfies (,  )-DP if for all pairs
, ′ of adjacent datasets, and for all outcome events , it
holds that Pr[ℳ() ∈ ] ≤  · Pr[ℳ(′) ∈ ] +  .
For an extensive overview of DP, we refer the reader to the
monograph of Dwork and Roth [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. We use the following
key properties of DP.
      </p>
      <p>
        Proposition 2 (Composition). If ℳ1 satisfies (1, 
1)DP, and ℳ2 satisfies (2,  2)-DP, then the mechanism
ℳ that on dataset  returns (ℳ1(), ℳ2()) satisfies
(1 + 2,  1 +  2)-DP. Furthermore, this holds even in the
adaptive case, when ℳ2 can use the output of ℳ1.
Proposition 3 (Post-Processing). If ℳ satisfies (, 
)DP, then for all (randomized) algorithms , it holds that
(ℳ(· )) satisfies (,  )-DP.
2Contrast this with the notion of DP with public features of [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], in which
∘ is allowed to change, as long as it takes values in the publicly known
 ∘ .
      </p>
      <p>Randomized Response. Perhaps the simplest
mechanism that satisfies DP, even predating its definition, is
Randomized Response. We state the mechanism in our context as
releasing the known features along with the corresponding
randomized (binary) label.</p>
      <p>
        Definition 4 (Randomized Response; Warner [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]). For  &gt;
0, the mechanism RR on dataset  = ((∘ , ∙ , ))∈[]
returns ((∘ , ˆ))∈[] where each ˆ is set to  with prob.
 1
1+ and to 1 −  with prob. 1+ .
      </p>
      <p>Proposition 5. RR satisfies (, 0)-DP.</p>
      <p>SGD and DP-SGD. Let  be a parameterized model
(e.g., a neural network) with trainable weights , and
{(1, 1), . . . , ( ,  )} be a random mini-batch of
training examples. Let  = ℓ((), ) be the loss on the
th example and let the average loss be ¯ :=  =1 .
1 ∑︀
Recall that standard training algorithms compute the average
gradient ∇ ¯ and update  with an optimizer such as SGD
or Adam. Even though various optimizers could be used, we
will refer to this class of (non-private) methods as SGD.</p>
      <p>
        DP-SGD [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is widely used for DP training of deep
neural networks, wherein the per-example gradients ∇ are
computed, and then re-scaled to have an ℓ2-norm of at most
, as  := ∇ · min{1, ‖∇‖2 }. Gaussian noise
 (0, 2 2) is then added to the average 1 ∑︀
=1  and
subsequently passed to the optimizer. As shown by Abadi
et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], DP-SGD satisfies (,  )-DP where ,  depend
on  , the batch size and number of training steps; this can be
computed using the privacy accounting described in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Algorithms</title>
      <p>We now describe the family of algorithms we use for DP
training with semi-sensitive features. Consider a model, such
as a deep neural network, parameterized by . We will use
the following high-level architecture:</p>
      <p>(∘ , ∙ ) := c (∘ (∘ ), ℎ∙ (∙ )) ,
where  = (∘ , ∙ , c), ∘ :  ∘ → R∘ is a
nonsensitive tower (i.e. the part of the model that acts on the
nonsensitive features), ℎ∙ :  ∙ → R∙ is a sensitive tower
(acting on the sensitive features), and c : R∘ × R∙ → R
is a common tower.</p>
      <p>We also consider a truncated model that uses the same
parameters ∘ and c, but does not depend on ∙ , by
eliminating the dependence on ∙ , defined as follows:
∘ ,c (∘ ) := c (∘ (∘ ), 0) ,
where 0 ∈ R∙ . For convenience, we use the following
notation for the losses of each of these models:</p>
      <p>(; , ) := ℓ((), ),
(∘ , c; , ) := ℓ(∘ ,c (∘ ), ).</p>
      <p>Given a total privacy budget of (,  ), we consider learning
algorithms that execute two phases sequentially that satisfy
(1, 0)-DP and (2,  )-DP respectively such that 1+2 = 
and hence by Proposition 2, the algorithm satisfies (,  )-DP.
We refer to this algorithm as Hybrid, and these phases are as
follows:
Label-DP Phase. In this phase, we first apply RR1 to
generate ((∘ , ˆ))∈[], i.e., a dataset where the sensitive
∙ ’s are removed and the labels are randomized. Then, we
train the truncated model ∘ ,c (· ) on this data for one
or more epochs of mini-batch SGD. By Proposition 5 and
Proposition 3, this phase satisfies (1, 0)-DP.</p>
      <p>
        To remove the bias introduced by the noisy labels, we define
 := 1/(1 + − 1 ) and modify the training loss as follows:
˜(∘ , c; ∘ , ˆ) =
(∘ , c; ∘ , 1 − ˆ) −  ∑︀′∈{0,1} (∘ , c; ∘ , ′)
1 − 2
DP-SGD Phase. In this phase, we train the entire model
(· ), by warm-starting it from the ∘ ,c model of the
ifrst phase, then training for one or more epochs of DP-SGD.
We propose two variants: in the first, we freeze the sensitive
tower ∘ , and in the second, we fine-tune it. The noise
parameter  is chosen appropriately so that this phase satisfies
(2,  )-DP; in our work, we do this accounting using Rényi
DP [
        <xref ref-type="bibr" rid="ref13 ref24">13, 24</xref>
        ], though other accounting techniques could be
used, such as privacy loss distributions (PLD) [
        <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Results</title>
      <p>We consider two natural baselines: DP-SGD (where all
features are treated as sensitive) and RR on the truncated model
∘ ,c (where the sensitive features are discarded and only
the labels are protected). Note that both can be viewed as
special cases of Hybrid, where we use all the privacy budget
in one of the two phases: DP-SGD corresponds to setting
1 = 0 and 2 = ; and RR corresponds to setting 1 = 
and 2 = 0.</p>
      <p>The Hybrid algorithm allows using a different split
between the two phases. A total privacy budget (,  ) will be
split into (1, 0) and (2,  ). Since this budget allocation
may have an impact on model quality, we will vary it in our
experiments as follows:
1 :=  ·  ,</p>
      <p>2 := (1 − ) ·  ,
where  ∈ {0, 0.25, 0.5, 0.75, 1}, and the cases  = 0,
 = 1 correspond to DP-SGD and RR, respectively.</p>
      <p>We train binary classification models with binary
crossentropy loss and report it together with the AUC loss (defined
as 1 − AUC). We study the trade-offs between privacy and
utility, as well as model size and utility.</p>
      <sec id="sec-5-1">
        <title>5.1. Models</title>
        <p>We evaluate two model classes for : multilayer
perceptrons (MLP) and factorization machines (FM).</p>
        <p>Multilayer perceptron In the MLP models, we
concatenate the outputs of the sensitive and nonsensitive towers
before feeding them into joint fully connected layers:
(∘ , ∙ ) := c (∘ (∘ ) ∘ ℎ∙ (∙ )) ,
where ∘ :  ∘ → R∘ , ℎ∙ :  ∙ → R∙ , c :
R∘ +∙ → R and  ∘  denotes concatenation of the vectors
 and .</p>
        <p>
          Factorization Machine A factorization machine
(FM) [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] embeds each feature into a  = ∘ = ∙
dimensional embedding and builds all pairwise dot products
between all features. We shortly summarize how FM can
be cast into our notation of . The parameters of the
FM model consist of (i) embeddings for the sensitive and
nonsensitive features ∘ ∈ R ∘ ×  and ∙ ∈ R ∙ × , and
(ii) a bias c ∈ R. The combination function c consists
of a sum of three different terms: the global bias c, linear
effects that are encoded in the first dimension of  and ℎ,
and pairwise interactions of the remaining dimensions:
c (, ℎ) = 1c + 1 + ℎ1 + ⟨2..., ℎ2...⟩ .
        </p>
        <p>Unlike an MLP, FM does not have parameters (besides a
bias) in the combiner function and does not need to learn
how to combine embeddings. The towers are computed by:
∘ (∘ )1 = ⟨∘ , ·∘ ,1⟩
+ ∑︁ ∑︁ ∘ ∘ ⟨︀ ∘,2..., ∘,2...︀⟩ .</p>
        <p>&gt;
∘ (∘ )2... = ∑︁ ∘ ∘,2....</p>
        <p>The sensitive tower is computed analogously.</p>
        <p>See Appendix A for further details about the experimental
setup.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Criteo Display Ads pCTR Dataset</title>
        <p>
          The first benchmark we consider is a pCTR prediction task
on the Criteo Display Ads Dataset [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], which contains
around 40 million examples. The dataset has a labeled
training set and an unlabeled test set. We only use the labeled
training set and split it chronologically into a 80%/10%/10%
partition of train/validation/test sets. Each example consists
of 13 integer features int-feature-[
          <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref13 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">1-13</xref>
          ] and 26
categorical features categorical-feature-[
          <xref ref-type="bibr" rid="ref14 ref15 ref16 ref17 ref18 ref19 ref20 ref21 ref22 ref23 ref24 ref25 ref26 ref27 ref28 ref29 ref30 ref31">14-39</xref>
          ].
Since the precise interpretation of these features is not
available, we arbitrarily consider all even-numbered features as
sensitive and all odd-numbered features as nonsensitive.
        </p>
        <p>For this dataset, the AUC loss of the non-privately trained
baselines is 0.1941 for the MLP model and 0.1930 for the
FM model.
5.3.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Criteo Spons. Search Conversion</title>
      </sec>
      <sec id="sec-5-4">
        <title>Log Dataset</title>
        <p>
          The second benchmark we consider is a pCVR
prediction task on the Criteo Spons. Search Conversion Log
Dataset [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], which contains 16 millions examples. We
used a random 80%/20% partition of train/test sets and the
reported metrics are on the test set. The task considered in
this work is a conversion prediction task, predicting the
binary feature Sale (which has 10.8% positive occurrences).
The sensitive features are device_type, audience_id,
and user_id. We consider all other features to be
nonsensitive, except for features denoted Outcome/Labels in [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ],
and product_price, all of which are omitted from the
model3.
        </p>
        <p>For this dataset, the AUC loss of the non-privately trained
baselines is 0.2099 for the MLP model, and 0.2154 for the
FM model.
3Although product_price is not explicitly marked as a label, it has
a very high correlation with the label and the prediction task would
become significantly easier if we were to include it.
0.195 1 2
4
8
4
8
12
4</p>
        <p>8
Hybrid (frozen)
Hybrid (fine-tuned)
DP-SGD
RR
Hybrid (frozen)
Hybrid (fine-tuned)
DP-SGD
RR
(i) MLP model
RHHDRyyPbb-SrrGiiddD((ffirnoeze-tnu)ned) 000...333011802
s0.306
so
llog0.294
0.292
0.290
8 10 12 0.288 0</p>
        <p>(i) MLP model</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.4. Results</title>
        <sec id="sec-5-5-1">
          <title>5.4.1. Improved privacy-utility trade-off</title>
          <p>Privacy-utility trade-offs on Criteo Display Ads and Criteo
Spons. Search are reported in Figures 1 and 2 respectively.
On both benchmarks, we find that Hybrid improves over RR
and DP-SGD across a range of privacy budgets. Specifically,
we see an improvement in utility for both the MLP and FM
models when  ≥ 4. In this regime, there are substantial
improvements: for instance, Hybrid achieves a better utility
at  = 8 than DP-SGD at  = 12 (this is the case for both
datasets and both models).</p>
          <p>This significantly narrows the gap between the private
model and the non-private baselines on Criteo Display Ads.
For example, at  = 12, the relative increase in AUC loss
(defined as 1− A1U−CAnoUn-Cprivate − 1) goes from 6.3% for
MLPDPSGD to 3.2% for MLP-Hybrid; and it goes from 3.0% for
FM-DPSGD to 1.2% for FM-Hybrid. In both cases, the gap
to the non-private model is approximately halved.</p>
          <p>However, in the higher privacy regime (for  = 1), the
quality of the Hybrid-trained models appears to deteriorate,
and in most cases it no longer improves upon DP-SGD. We
believe this may be because the utility of the RR algorithm
significantly deteriorates for small , and there may no longer
be a benefit to the Label-DP phase in this regime.</p>
          <p>It is also worth observing that the loss of the RR model
plateaus at a value that is much higher than other methods –
recall that the RR model is only trained on the non-sensitive
features, hence its quality is limited by the best model one
can train on these features alone.</p>
        </sec>
        <sec id="sec-5-5-2">
          <title>5.4.2. Freezing vs fine-tuning</title>
          <p>Freezing the sensitive tower during the second phase may
offer a computational advantage, as one no longer needs to
compute/clip gradients of this tower.</p>
          <p>To understand the impact this may have on quality, we
compare the two Hybrid variants (frozen and fine-tuned). We
observe that in some settings (specifically on Criteo Display
(ii) FM model
HHyybbrriidd ((ffirnoez-etnu)ned) 00..331120
RDRP-SGD 00..330068
s0.304
so
llg00..229964
o
0.292
0.290
8 10 12 0.288 0</p>
          <p>(ii) FM model
=1
=2
=4
=8
=12
0.00
To further understand the effect of the budget split, we report,
in Figure 3, the AUC loss of the Hybrid FM models, as
we vary the budget allocation ratio  = 1 . First, observe
that for  = 1, the best utility is achieved when  = 0
(which corresponds to the special case of DP-SGD); this is
consistent with the results of Section 5.4.1. As  increases
(4 ≤  ≤ 8), we observe that the optimal  increases, and
the best utility is typically achieved when  ≥ 0.5, i.e. one
benefits from spending a significant part of the budget on
the RR phase. Finally, in the high  regime ( = 12), the
optimal  decreases again, and is equal to  = 0.25 in both
datasets. This may be explained by the fact that the utility of
the RR-trained model plateaus when  grows (see Figure
1(ii) and Figure 2-(ii)), so one may not benefit from spending
a higher budget on the RR phase.</p>
          <p>Given the large impact the budget ratio  has on quality,
one should generally treat it as an important parameter to
tune when using the Hybrid method, and it should be tuned
separately for different values of .</p>
          <p>0.1960
0.1958
CU0.1956
A
10.1954
0.1952
0.1950</p>
        </sec>
        <sec id="sec-5-5-3">
          <title>5.4.4. Model-size utility trade-off</title>
          <p>In situations in which the DP-SGD phase of training happens
in Trusted Executions Environments, one may be faced with
stringent memory and compute constraints. In such scenarios,
it is important to understand the trade-offs between utility
and the size of the private model.</p>
          <p>We vary the private model size on the Criteo Display Ads
pCTR dataset, by varying the vocabulary size of the sensitive
tower (this is done by computing privatized counts of the
sensitive feature values, and keeping only features above a
threshold. Varying the threshold leads to different model
sizes). We report the results in Figure 4, for  = 12. Here
the largest model size corresponds to the results reported in
Figure 1.</p>
          <p>We observe that for a large range of model sizes, the
deterioration in quality is surprisingly low. For example, focusing
on the fine-tuned variant, when decreasing the model size
ten-fold, the log loss of the FM model increases by 0.027%,
and its AUC loss increases by 0.025%. When decreasing
the model size fifty-fold, the log loss increases by 0.059%
and the AUC loss by 0.078%. The loss remains well below
that of the full-sized model at  = 8 (denoted by the dashed
lines on the figure). This indicates that for these two
benchmarks, one may train significantly smaller models under DP
constraints without largely sacrificing quality.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future</title>
    </sec>
    <sec id="sec-7">
      <title>Directions</title>
      <p>In this work, we studied training DP models with
semisensitive features, and presented an algorithm that improves
over two natural baselines on real ad modeling datasets.</p>
      <p>
        Our experiments indicate that in the high privacy regime,
it is difficult to improve upon DPSGD. This invites further
investigation into this regime, either theoretically (by studying
utility bounds), or experimentally. An interesting direction
to explore is the use of label DP primitives beyond RR, e.g.,
[
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ], particularly ones that perform better for smaller .
      </p>
      <p>
        Another open question is the precise characterization of
the differences (both in terms of privacy guarantees and
potential utility gap) between the notion of “DP with
semisensitive features” studied in this work, and “DP with public
features” from [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
    </sec>
    <sec id="sec-8">
      <title>A. Training Details</title>
      <p>In both Criteo datasets used in this work, the sensitive
features (resp. nonsensitive features) are fed into a single
embedding layer: each value of a sensitive (resp. nonsensitive)
feature is concatenated to its feature name to form a unique
string. These string values are then either hashed with a
ifxed number of hash bins, or a vocabulary of all sensitive
(nonsensitive) strings is created, where only frequent values
are kept while the remaining values are mapped to a single
out-of-vocabulary token. In all our experiments reported
here we use an embedding dimension of 32. The model size
is therefore controlled either by the number of hash bins
for the sensitive and nonsensitive features, or by frequency
thresholds defining the the sensitive and nonsensitive
vocabularies. In the latter case, we compute the frequency counts
of the nonsensitive feature values exactly, while the counts
of the sensitive features are computed privately, consuming
a portion of the total privacy budget. We report the privacy
parameters assuming that the vocabulary is known and only
the counts are private; without this assumption the privacy
 increases by at most 0.007 compared to the reported
numbers.</p>
      <p>We compare these two strategies (hashing vs vocabulary
thresholding) on the Criteo Display Ads pCTR dataset, and
report the results in Figure 5. We find that the two approaches
yield similar quality at the same model size (with a slight
advantage to the vocab thresholding approach).</p>
      <p>When using thresholding in Criteo Display Ads, we used
a vocab threshold of 16 for non-sensitive features, and a
threshold in {16, 64, 256, 1024, 4096} for sensitive features
(this controls the sensitive tower size). On the Criteo
Sponsored Search Conversion Log dataset, we opt for the simpler
hashing approach, with 50k (resp. 100k) hash bins for the
sensitive (resp. non-sensitive) features.</p>
      <p>All features in both datasets being univalent, each example
is transformed into a fixed number of embeddings, 19
sensitive and 20 nonsensitive embeddings for the Criteo Display
Ads pCTR dataset, and 3 sensitive and 17 nonsensitive4
embeddings for the Criteo Sponsored Search Conversion Log
dataset.</p>
      <p>
        Multilayer Perceptron For the Criteo Display Ads
pCTR dataset, the 20 nonsensitive features are concatenated
and fed into a single fully connected layer with 598 hidden
units and using a ReLU activation function. The output of
this layer is concatenated with the 19 embeddings of the
sensitive features, and these are fed into two fully connected
layers, each also containing 598 hidden units and using a
ReLU activation function. The final output is a linear
combination of the last layer which produces a scalar logit
prediction. We use the Yogi optimizer [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] with a base learning
rate of 0.01 and a batch size of 1024 for our baseline and
for the RR phase of training. We use SGD with a base
learning rate of 0.1, momentum 0.9, and batch size of 16384 for
the DP-SGD phase of training. For both, we scale the base
learning rate with a cosine decay [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]. We train with 10
RR epochs and 50 or 100 DP-SGD epochs, and we tune the
norm bound  ∈ [
        <xref ref-type="bibr" rid="ref10">10, 50</xref>
        ].
      </p>
      <p>
        For the Criteo Sponsored Search Conversion Log dataset
dataset, the 17 nonsensitive features are concatenated and fed
into a single fully connected layer with 256 hidden units and
using a ReLU activation function. The output of this layer is
concatenated with the 3 embeddings of the sensitive features,
and these are fed into two fully connected layers, each also
containing 256 hidden units and using a ReLU activation
function. The final output is a linear combination of the last
layer which produces a scalar logit prediction. We use the
Adam optimizer with batch size of 1024 for the RR phase of
training, and a batch size of 16384 for the DP-SGD phase
of training. We train with 16 RR epochs and 64 DP-SGD
epochs, and we tune the clipping norm  ∈ [
        <xref ref-type="bibr" rid="ref10 ref30">10, 30</xref>
        ].
4click_timestamp is replaced by two features:
click_hour_of_day and click_day_of_week, and
nbr_clicks_1week is replaced by its log2 transformed value.
Factorization Machine A linear model composed of a
bias term and 20+19 (resp. 17+3) linear coefficients
complements the above mentioned embeddings for the Criteo
Display Ads pCTR dataset (resp. the Criteo Sponsored Search
Conversion Log dataset). The scalar logit prediction is the
sum of this linear model and all the pairwise dot-products of
the 39 (resp. 20) embeddings. We use the Adam optimizer
with a batch size of 16384 for all our experiments. In an
initial hyper-parameter search we also tuned the standard
deviation of the random initialization  of all model parameters,
as well as the regularization  of the embeddings and we
settled on  = 10− 2 and  ∈ {10− 2, 10− 3, 10− 4, 10− 5}
for all experiments. In practice, the regularization parameter
had little impact. The most important parameters to tune
are the learning rate (tuned in the range [10− 5, 10− 3]) and
the clipping norm (tuned in the range [
        <xref ref-type="bibr" rid="ref10 ref30">10, 30</xref>
        ]. Note that
we did not distinguish the dataset used for hyper-parameter
tuning from the one used to report the final metrics as our
experiments on the Criteo Display Ads pCTR dataset showed
virtually no difference between metrics measured on either
sets.
      </p>
      <p>Finally, we note that the optimal hyper-parameters tend to
differ when optimizing for AUC loss vs log loss. In particular,
we found that good models in terms of log loss tend to require
much larger clipping norms than models optimizing AUC.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wilander</surname>
          </string-name>
          ,
          <source>Full Third-Party Cookie Blocking and More</source>
          ,
          <year>2020</year>
          . https://webkit.org/blog/10218/full-third
          <article-title>-p arty-cookie-blocking-and-more/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wood</surname>
          </string-name>
          ,
          <source>Today's Firefox Blocks Third-Party Tracking Cookies and Cryptomining by Default</source>
          ,
          <year>2019</year>
          . https: //blog.mozilla.org/en/products/firefox/todays-firefox
          <string-name>
            <surname>-</surname>
          </string-name>
          blocks
          <article-title>-third-party-tracking-cookies-and-cryptomin ing-by-default/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schuh</surname>
          </string-name>
          ,
          <article-title>Building a more private web: A path towards making third party cookies obsolete</article-title>
          ,
          <year>2020</year>
          . https://blog .chromium.org/
          <year>2020</year>
          /01/building-more
          <article-title>-private-web -path-towards</article-title>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Thomson</surname>
          </string-name>
          , Privacy Preserving Attribution for Advertising,
          <year>2022</year>
          . https://blog.mozilla.org/en/mozilla/pr ivacy-preserving
          <article-title>-attribution-for-advertising/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. J</given-names>
            .
            <surname>Pfeiffer</surname>
          </string-name>
          <string-name>
            <given-names>III</given-names>
            ,
            <surname>D. Charles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. H.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Parsana</surname>
          </string-name>
          , E. Anderson,
          <article-title>Masked lark: Masked learning, aggregation and reporting workflow</article-title>
          ,
          <source>arXiv:2110.14794</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Microsoft</surname>
            ,
            <given-names>MaskedLARk</given-names>
          </string-name>
          ,
          <year>2021</year>
          . https://github.com/m icrosoft/maskedlark.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nalpas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>White</surname>
          </string-name>
          , Attribution Reporting,
          <year>2021</year>
          . ht tps://developer.chrome.com/en/docs/privacy-sandbox /attribution-reporting/.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Android</surname>
          </string-name>
          , Attribution reporting,
          <year>2023</year>
          . https://developer. android.
          <article-title>com/design-for-safety/privacy-sandbox/attri bution</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wilander</surname>
          </string-name>
          , Introducing Private Click Measurement,
          <string-name>
            <surname>PCM</surname>
          </string-name>
          ,
          <year>2021</year>
          . https://webkit.org/blog/11529/introducin g-private
          <article-title>-click-measurement-pcm/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Winstrom</surname>
          </string-name>
          ,
          <article-title>A proposal for privacy preserving ad attribution measurement using prio-like architecture</article-title>
          ,
          <year>2023</year>
          . https://github.com/patcg/proposals/issues/17.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dwork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kenthapadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>McSherry</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mironov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naor</surname>
          </string-name>
          ,
          <article-title>Our data, ourselves: Privacy via distributed noise generation</article-title>
          ,
          <source>in: EUROCRYPT</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>486</fpage>
          -
          <lpage>503</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dwork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>McSherry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>Calibrating noise to sensitivity in private data analysis</article-title>
          ,
          <source>in: TCC</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>265</fpage>
          -
          <lpage>284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chu</surname>
          </string-name>
          , I. Goodfellow, H. B.
          <string-name>
            <surname>McMahan</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Mironov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Talwar</surname>
            ,
            <given-names>L. Zhang,</given-names>
          </string-name>
          <article-title>Deep learning with differential privacy</article-title>
          ,
          <source>in: CCS</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>308</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <article-title>Sample complexity bounds for differentially private learning</article-title>
          ,
          <source>in: COLT</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>155</fpage>
          -
          <lpage>186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ghazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Golowich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Manurangsi</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. Zhang,</surname>
          </string-name>
          <article-title>Deep learning with label differential privacy</article-title>
          ,
          <source>in: NeurIPS</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>27131</fpage>
          -
          <lpage>27145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Malek Esmaeili</surname>
          </string-name>
          , I. Mironov,
          <string-name>
            <given-names>K.</given-names>
            <surname>Prasad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Shilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tramer</surname>
          </string-name>
          ,
          <article-title>Antipodes of label differential privacy: PATE and ALIBI</article-title>
          , in: NeurIPS, volume
          <volume>34</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>6934</fpage>
          -
          <lpage>6945</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>Protected Audience</surname>
            <given-names>API</given-names>
          </string-name>
          :
          <article-title>Ondevice ad auctions to serve remarketing and custom audiences, without cross-site third-party tracking</article-title>
          .,
          <year>2022</year>
          . https://developer.chrome.com/docs/privacy-sandbox /protected-audience.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Android</surname>
          </string-name>
          ,
          <article-title>Protected Audience API on Android developer guide</article-title>
          .,
          <year>2023</year>
          . Https://developer.android.com/design-forsafety/privacy-sandbox/guides/protected-audience.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>W.</given-names>
            <surname>Krichene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. E</given-names>
            <surname>Mayoraz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thakurta</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Zhang,</surname>
          </string-name>
          <article-title>Private learning with public features</article-title>
          ,
          <source>in: AISTATS</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>4150</fpage>
          -
          <lpage>4158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krishnaswamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Munagala</surname>
          </string-name>
          ,
          <article-title>Classification with partially private features</article-title>
          ,
          <source>arXiv:2312.07583</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>W.</given-names>
            <surname>Krichene</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mayoraz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thakurta</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Zhang,</surname>
          </string-name>
          <article-title>Private learning with public features</article-title>
          ,
          <source>arXiv:2310.15454</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dwork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <article-title>The algorithmic foundations of differential privacy</article-title>
          ,
          <source>Foundations and Trends® in Theoretical Computer Science</source>
          <volume>9</volume>
          (
          <year>2014</year>
          )
          <fpage>211</fpage>
          -
          <lpage>407</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Warner</surname>
          </string-name>
          ,
          <article-title>Randomized response: a survey technique for eliminating evasive answer bias</article-title>
          .,
          <source>JASA 60 309</source>
          (
          <year>1965</year>
          )
          <fpage>63</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>I. Mironov</surname>
          </string-name>
          ,
          <article-title>Rényi differential privacy</article-title>
          ,
          <source>in: CSF</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>263</fpage>
          -
          <lpage>275</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Meiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Mohammadi,</surname>
          </string-name>
          <article-title>Tight on budget? Tight bounds for -fold approximate differential privacy</article-title>
          ,
          <source>in: CCS</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>247</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>D. M. Sommer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Meiser</surname>
          </string-name>
          , E. Mohammadi,
          <article-title>Privacy loss classes: The central limit theorem in differential privacy</article-title>
          ,
          <source>Proc. Priv. Enhancing Technol</source>
          .
          <year>2019</year>
          (
          <year>2019</year>
          )
          <fpage>245</fpage>
          -
          <lpage>269</lpage>
          . URL: https://doi.org/10.2478/popets-2019
          <source>-0 029</source>
          . doi:
          <volume>10</volume>
          .2478/POPETS-2019-0029.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Rendle</surname>
          </string-name>
          ,
          <article-title>Factorization machines</article-title>
          , in: ICDM,
          <year>2010</year>
          , pp.
          <fpage>995</fpage>
          -
          <lpage>1000</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>O. C.</given-names>
            <surname>Jean-Baptiste</surname>
          </string-name>
          <string-name>
            <surname>Tien</surname>
          </string-name>
          , joycenv, Display advertising challenge,
          <year>2014</year>
          . URL: https://kaggle.com
          <article-title>/competition s/criteo-display-ad-challenge.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Yadav</surname>
          </string-name>
          ,
          <article-title>Reacting to variations in product demand: An application for conversion rate (cr) prediction in sponsored search</article-title>
          , arXiv:
          <year>1806</year>
          .
          <volume>08211</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaheer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sachan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Adaptive methods for nonconvex optimization</article-title>
          ,
          <source>in: NIPS</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <article-title>SGDR: stochastic gradient descent with warm restarts</article-title>
          ,
          <source>in: ICLR</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>