<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Field-wise Embedding Size Search via Structural Hard Auxiliary Mask Pruning for Click-Through Rate Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tesi Xiao</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xia Xiao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ming Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Youlong Cheng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ByteDance, Mountain View</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of California</institution>
          ,
          <addr-line>Davis</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Feature embeddings are one of the most important steps when training deep learning based Click-Through Rate prediction models, which map high-dimensional sparse features to dense embedding vectors. Classic human-crafted embedding size selection methods are shown to be “sub-optimal" in terms of the trade-of between memory usage and model capacity. The trending methods in Neural Architecture Search (NAS) have demonstrated their eficiency to search for embedding sizes. However, most existing NAS-based approaches sufer from expensive computational costs, the curse of dimensionality of the search space, and the discrepancy between continuous search space and discrete candidate space. Other methods that prune embeddings in an unstructured manner fail to explicitly reduce the computational costs. In this paper, to address those limitations, we propose a novel strategy that searches for the optimal mixed-dimension embedding scheme by structurally pruning a super-net via Hard Auxiliary Mask. Our method aims to directly search candidate models in the discrete space using a simple and eficient gradient-based method. Furthermore, we introduce orthogonal regularity on embedding tables to reduce correlations within embedding columns and enhance representation capacity. Extensive experiments demonstrate it can efectively remove redundant embedding dimensions without great performance loss.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;CTR Prediction</kwd>
        <kwd>Embedding Size</kwd>
        <kwd>Network Pruning</kwd>
        <kwd>Neural Architecture Search</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        embedding dimension for all features, either due to the
prerequisites of the model input or simply for the sake
Deep learning based recommender systems (DLRS) have of convenience. If the embedding dimensions are
unidemonstrated superior performance over more tradi- formly high, it leads to increased memory usage and
tional recommendation techniques [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The success of computational cost, as it fails to handle the heterogeneity
DLRS is mainly attributed to their ability to learn mean- among diferent features. As a concrete example,
encodingful representations with categorical features, that sub- ing features with few unique values like gender with large
sequently help with modeling the non-linear user-item embedding vectors leads to over-parametrization.
Conrelationships eficiently. Indeed, real-world recommenda- versely, the selected embedding size may be insuficient
tion tasks usually involve a large number of categorical for highly-predictive features with large cardinalities,
feature fields with high cardinality (i.e. the number of such as the user’s last search query. Therefore, finding
unique values or vocabulary size) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. One-Hot Encoding appropriate embedding dimensions for diferent feature
is a standard way to represent such categorical features. fields is essential.
      </p>
      <p>
        To reduce the memory cost of One-Hot Encoding, DLRS The existing works towards automating embedding
ifrst maps the high-dimensional one-hot vectors into size search can be categorized into two groups: (i)
fieldreal-valued dense vectors via the embedding layer. Such wise search [
        <xref ref-type="bibr" rid="ref2 ref5 ref6">5, 2, 6</xref>
        ]; (ii) vocabulary-wise search [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8, 9, 10</xref>
        ].
embedded vectors are subsequently used in predictive The former group aims to assign diferent embedding
models for obtaining the required recommendations. sizes to diferent feature fields, while embeddings in the
      </p>
      <p>
        In this pipeline, the choice of the dimension of the same feature field share the same dimension. The
latembedding vectors, also known as embedding dimension, ter further attempts to find diferent embedding sizes
plays a crucial role in the overall performance of the to diferent feature values within the same feature field,
DLRS. Most existing models assign a fixed and uniform which is generally based on the frequencies of feature
DL4SR’22: Workshop on Deep Learning for Search and Recommen- values. Although it has been shown that the latter group
dation, co-located with the 31st ACM International Conference on can significantly reduce the model size without a great
Information and Knowledge Management (CIKM), October 17-21, 2022, performance drop, the second group of works sufers
Atlanta, USA from several challenges and drawbacks (we refer readers
* Corresponding author. to Section 3.4 in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for details): (a) a large number of
($X. tXexiaiaoo);@muicndga.cvhise.ned@ub(yTt.eXdaianoc)e;.cxo.xmiax(Mia.oC@hbeynt)e;dance.com unique values in each feature field leads to a huge search
youlong.chen@bytedance.com (Y. Cheng) space in which the optimal solution is dificult to find;
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License (b) the feature values frequencies are time-varying and
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
Hard Auxiliary Mask
Embedding Vector
Embedding Table
      </p>
      <p>!
Interaction Layers
1 0 1 1</p>
      <p>⊙
-0.6 0.3 -0.2 0.12
Embedding</p>
      <p>
        Look-up
not pre-known in real-time recommender system; (c) it is into the same dimension with batch normalization so
dificult to handle embeddings with diferent dimensions that they can be aggregated with architecture weights
for the same feature field in the state of art DLRS due to of candidate sizes and fed into the subsequent layers to
the feature crossing mechanism. As a result, we fix our obtain the prediction. Cheng et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] also leverage the
attention to the field-wise embedding size search in this idea of DARTS but with the architecture weights being
work. the weights of sub-blocks of embedding matrices.
      </p>
      <p>
        The majority of recent works towards automating em- Despite the obtained promising results, the existing
bedding size search are based on the trending ideas in methods have certain limitations which arise from the
Neural Architecture Search (NAS). NAS has been an ac- discrepancy between the large discrete candidate space
tive research area recently, which is composed of three and its continuous relaxation. On the one hand, the
elements: (i) a search space  of all candidate models; DARTS-based methods in [
        <xref ref-type="bibr" rid="ref1 ref2 ref5">5, 2, 1</xref>
        ] are algorithmically
(ii) a search strategy that goes through models in ; (iii) eficient but search in an augmented continuous space;
a performance estimation strategy that evaluates the per- it has been shown that the discrepancy may lead to
formance of the selected model. To leverage the NAS ifnding unsatisfactory candidates because the magnitude
techniques to search embedding sizes for each feature of architecture weights does not necessarily indicate
ifeld, Joglekar et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] formulate the search space by dis- how much the operation contributes to the overall
cretizing an embedding matrix into several sub-matrices performance [12]. On the other hand, while the RL-based
and take a Reinforcement Learning (RL) approach with methods in [
        <xref ref-type="bibr" rid="ref6 ref7">7, 6</xref>
        ] and the hard selection variant in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
the Trainer-Controller framework for search and per- directly search and evaluate models in the original
formance estimation, in which the controller samples discrete candidate space, they are computationally heavy
diferent embedding blocks to maximize the reward ob- and thus not favored by the large-scale DLRS. Therefore,
served from the trainer that trains the model with the a natural question follows:
controller’s choices. Liu et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] also adopt the RL-based
search which starts with small embedding sizes and takes Is it possible to come up with an eficient method that
actions to keep or enlarge the size based on the predic- searches straight through the discrete candidate space of
tion error. Their proposed method requires sequentially all possible embedding sizes?
training DRLS and the policy network repeatedly until
convergence, which is computationally expensive. In this work, we provide a positive answer by
propos
      </p>
      <p>
        Given the rise of one-shot NAS aiming to search ing an end-to-end field-aware embedding size search
and evaluate candidate models together in an over- method leveraging the Straight Through Estimator (STE)
parametrized supernet, Zhao et al. [
        <xref ref-type="bibr" rid="ref2 ref5">5, 2</xref>
        ] follow the idea of gradients for non-diferentiable functions. Motivated
of Diferential Neural Architecture Search (DARTS)[ 11] by one-shot NAS and network pruning, our method seeks
and solve a continuous bilevel optimization by the to adaptively search for a good sub-network in a
pregradient-based algorithm. To be specific, they define trained supernet by masking redundant dimensions. The
a search space that includes all possible embedding sizes mask, named as Hard Auxiliary Mask (HAM), is
realof each feature field; for instance in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], 5 candidate ized by the indicator function of auxiliary weights.
Fursizes {2, 8, 16, 24, 32} are selected for each feature field. thermore, to reduce the information redundancy and
These embedding vectors in diferent sizes are then lifted boost the search performance, we impose the orthogonal
regularity on embedding tables and train models with
regularized loss functions. Contributions and
novelty of our work are: (i) we propose to prune
embedding tables column-wisely via hard auxiliary masks for
the CTR prediction models, which can efectively
compress the model; (ii) we ofer a gradient-based pruning
method and delve into its dynamics with mathematical
insights; (iii) we introduce orthogonal regularity for
training recommendation models and experimentally show
that orthogonal regularity can help to achieve significant
improvement; (iv) we obtain state-of-the-art results on
various modern models and our method is scalable on
large datasets.
2. Related Work
puts, existing works attempt to cast it into a diferentiable
problem and provide gradient-based search algorithms. A
straightforward method is to directly replace binary
values by smooth functions of auxiliary parameters during
the forward pass, such as sigmoid [16], Gumbel-sigmoid
[17], softmax [11], Gumbel-softmax [18], and piece-wise
linear [19]. However, all these methods, which we name
Soft Auxiliary Mask (SAM), sufer from the discrepancy
caused by continuous relaxation. Other methods, which
we refer to Hard Auxiliary Mask (HAM), preserve binary
values in the forward pass with indicator functions [20]
or Bernoulli random variables [21] and optimize
parameters using the straight-through estimator (STE) [22, 23].
      </p>
      <p>A detailed comparison of four representative auxiliary
masking methods for approximating binary values is
provided in Table 1.</p>
      <sec id="sec-1-1">
        <title>Embedding Size Search for Recommender Systems.</title>
        <p>
          To deal with the heterogeneity among diferent features 3. Methodology
and reduce the memory cost of sizable embedding tables,
several recent works introduce a new paradigm of mixed- 3.1. Preliminaries
dimension embedding tables. In particular, Ginart et al.
[10] propose to substitute the uniform-dimension embed- Here we briefly introduce the mechanism of DLRS and
dings V ∈ R×  with V = EP, where E ∈ R×  is define the terms used throughout the paper.
the smaller embeddings with  ≪  and P ∈ R×  is the Model Architecture. Consider the data input
projection that lift the embeddings into the base dimen- involves  features fields from users, items, their
sion  for feature crossing purposes. A popularity-based interactions, and contextual information. We
derule is further designed to find the dimension  for each note these raw features by multiple one-hot vectors
feature field which is too rough and cannot handle impor- x1 ∈ R1 , . . . , x ∈ R , where field dimensions
tant features with low frequency. In addition, plenty of 1, . . .  are the cardinalities of feature fields 1.
Arworks seek to provide an end-to-end framework leverag- chitectures of DLRS often consist of three key
coming the advances in NAS, including RL approaches [
          <xref ref-type="bibr" rid="ref6 ref7">7, 6</xref>
          ] ponents: (i) embedding layers with tables V1 ∈
and diferentiable NAS methods [
          <xref ref-type="bibr" rid="ref1 ref2 ref5">5, 13, 2, 1</xref>
          ]. The most R1× 1 , . . . , V ∈ R ×  that map sparse one-hot
relevant work to us is [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], which employs a soft auxiliary vectors to dense vectors in a low dimensional embedding
mask to prune the embeddings structurally. However, it space by v = V⊤x; (ii) feature interaction layers that
requires magnitude-based pruning to derive fine-grained model complex feature crossing using embedding
vecmixed embedding, in which the discrepancy between the tors; (iii) output layers that make final predictions for
continuous relaxed search space and discrete candidate specific recommendation tasks.
space could lead to a relatively great loss in performance. The feature crossing techniques of existing models
The most recent work [14] proposes a Plug-in Embed- fall into two types - vector-wise and bit-wise. Models
ding Pruning (PEP) approach to prune embedding tables with vector-wise crossing explicitly introduce
interacinto sparse matrices to reduce storage costs, Albeit sig- tions by the inner product, such as Factorization Machine
nificantly reducing the number of non-zero embedding (FM) [26], DeepFM [27] and AutoInt [28]. The bit-wise
parameters, this type of unstructured pruning method crossing, in contrast, implicitly adds interaction terms
fails to explicitly reduce the embedding size. by element-wise operations, such as the outer product
        </p>
        <p>Neural Network Pruning via Auxiliary Masks. To in Deep Cross Network (DCN) [29], and the Hadamard
deploy deep learning models on resource-limited devices, product in NFM [30] and DCN-V2 [31]. In this work, we
there is no lack of work in neural network pruning; see deploy our framework to the Click-Through Rate (CTR)
[15] and the reference therein. One popular approach prediction problem with four base models: FM, DeepFM,
towards solving this problem is to learn a pruning mask AutoInt, and DCN-V2.
through auxiliary parameters, which considers pruning Notations. Throughout this paper, we use ⊙ for
as an optimization problem that tends to minimize the the Hadamard (element-wise) product of two vectors.
supervised loss of the masked neural network with a The indicator function of a scalar-valued  is defined
certain sparsity constraint. As learning the optimal mask as 1&gt; 0 = 1 if  &gt; 0; 1&gt; 0 = 0 if  ≤ 0. The
is indeed a discrete optimization problem for binary
in</p>
        <sec id="sec-1-1-1">
          <title>1Numeric features are converted into categorical data by binning.</title>
          <p>indicator function of a vector  ∈ R is defined as
1 &gt;0 = [1 1&gt;0, . . . , 1 &gt;0]⊤. The identity matrix is
written as I and the function diag(· ) returns a diagonal
matrix with its diagonal entries as the input. We use
‖ · ‖ 1, ‖ · ‖  for the ℓ1 norm of vectors and the Frobenius
norm of matrices respectively.
probability score ˜ is given as follows:
˜ =  (v ⊙ 1 &gt;0|Θ) = (x| V˜ , Θ) = (x| , V, Θ),
where V˜ = {V˜ 1 1 , . . . , V˜  } is the pruned
embedding layer with
˜   = V diag(1 &gt;0),  = 1, . . . .</p>
          <p>V</p>
          <p>are pruned
We emphasize that embedding tables V˜ 
3.2. Background column-wisely in a structural manner unlike PEP [14]
In general, the CTR prediction models takes the concate- that prunes entry-wisely.
nation of all feature fields from a user-item pair, denoted Orthogonal Regularity Given the embedding table
V ∈ R ×  for feature field , its column vectors,
deby x = [x1; x2; . . . ; x ] as the input vector. Given the
embedding layer V = {V1, V2, . . . , V }, the feature noted by V,1, . . . , V, ∈ R , can be regarded as 
interaction layers take the embedding vectors v and feed diferent representations of feature  in the embedding
the learned hidden states into the output layer to obtain space. The auxiliary masks above aim to mask relatively
the prediction. The embedding vectors v are the dense uninformative column vectors so as to reduce the model
encoding of the one-hot input x that can be defined as size. Nonetheless, the presence of correlation between
follows: these vectors may complicate the selection procedure.</p>
          <p>
            Specifically, presuming that the most predictive one V,
v = [v1; v2; . . . ; v ] has been selected, it would be problematic if we greedily
= [︁V1⊤x1; V2⊤x2; . . . ; V⊤ x ]︁ := x, sderloepctwthheenneixntclcuodluedmninV.F,ortihnasttabnricneg,sifthVel,ar g̸⊥estVl o,ss,
where  is the embedding lookup operator. The predic- we have the following decomposition: V, = p + p⊥,
tion score ˆ is then calculated with models’ other param- where p = V, ‖ V, and p⊥ ⊥ p. Therefore, it
eters Θ in the feature interaction layers and output layer would be dificult to determine whether the increments
by ˆ =  (v|Θ) =  (x|Θ) = (x|V, Θ), where are attributed to the existing direction p or the new
facˆ ∈ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ] is the predicted probability,  (·| Θ) is the pre- tor p⊥. To address this issue, we follow [32] to train
diction function of embedding vectors, and  =  ∘  is embedding parameters V with Soft Orthogonal (SO)
regthe prediction function of raw inputs. To learn the model ularizations:
tphaeratmraeint einrsg Vda,tΘa,,i.we.e, aim to minimize the Log-loss on ℛ(V) = ∑︁ ‖V⊤V − I‖2 /2 , (1)
mV,iΘn ℒtrain(V, Θ)
          </p>
          <p>1 ∑︁ [︀  log(ˆ ) + (1 −  ) log(1 − ˆ )︀]
:= −</p>
          <p>=1
where  is the total number of training samples.</p>
          <p>Hard Auxiliary Mask As illustrated in Figure 1, we
add auxiliary masks for each embedding dimension slot.</p>
          <p>Specifically, the auxiliary masks are indicator functions
of auxiliary parameters  = [ 1;  2; . . . ;   ], where
  ∈ R is in the same size of the corresponding
embedding vector v. Provided with the masks, the predicted
=1
where divisors 2 are introduced to handle
heterogeneous dimensionality of embedding tables. We also adopt
a relaxed SO regularization in which V replaced by the
normalized matrix V with unit column vectors, which
corresponds to the pair-wise cosine similarities within
embedding columns [33].
3.3. Framework</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>Our proposed framework is motivated by one-shot NAS,</title>
          <p>which consists of three stages: pretrain, search, and
retrain.</p>
        </sec>
        <sec id="sec-1-1-3">
          <title>Pretrain. As shown in [34], pre-training architecture</title>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Early Stopper for the Search Stage. We can deter</title>
        <p>representations improve the downstream architecture
mine to stop the search stage if the number of positive
search eficiency. Therefore, in the pretraining stage, we
auxiliary weights is close to the target size with no
signiftrain the base model with a large embedding layer V. icant gain in validation AUC since the sign of auxiliary
The base dimension  for each feature field is
deterweights exactly indicates whether the corresponding
emmined by prior knowledge. In addition, the embedding
bedding components are pruned or not. On the contrary,
dimension  should not exceed the field dimension</p>
        <p>to avoid column-rank-deficiency. The SO regularization
it is hard to deploy the early stopper for the search stage
using other auxiliary mask pruning methods because the
term (1) is added to the mini-batch training loss for the
value of validation AUC during the search epochs is
unoptimizer to learn near-orthogonal embeddings. The
able to represent the performance of the model selected
learned model parameters, denoted by Vinit, Θinit, are
in the retraining stage.
passed to the search stage as initialization.</p>
        <p>Retrain. After training { , V, Θ} for several epochs</p>
        <p>Search. Provided with the pre-trained model, the goal in the search step, we obtain a binary mask based on
of the search stage is to find the column-wise sparse
the sign of 
embedding layer that preserves model accuracy, which
vergence. The early stopper is deployed for both
preand continue optimizing {V, Θ} till
concan be formulated as:

min min</p>
        <p>V,Θ
ℒtrain
︁( V˜ , Θ
︁)
+  ⃒⃒ ‖1 &gt;0‖1 − ⃒⃒ ,
training and retraining steps, which terminates training
if model performance on validation data cannot be
improved within a few epochs. The overall framework is
where ‖1 &gt;0‖1 counts the number of non-zero embed- described in Algorithm 1.
ding columns and  is the target number of non-zero
columns. Note that instead of direct regularization on
‖1 &gt;0‖1 as [16, 20, 25] do, we include the target
number  to reduce instability from batched training and the
choice of hyperparameter  . However, given that the
objective function above is non-diferentiable when</p>
        <p>
          = 0
and has a zero gradient anywhere else, traditional
gradient descent methods are not applicable. To that end,
we use the straight-through estimator (STE) [22], which
replaces the ill-defined gradient in the chain rule with a
fake gradient. Despite various smooth alternatives used
in the literature, such as sigmoid [16] and piecewise
polynomials [
          <xref ref-type="bibr" rid="ref9">35</xref>
          ], we adopt the simplest identity function for
backpropagation, i.e.,

ℒ =
        </p>
        <p>ℒ
1 &gt;0
1 &gt;0

≈
ℒ</p>
        <p>1 &gt;0 
=</p>
        <p>ℒ
1 &gt;0</p>
        <p>. (2)
Then, the search stage starts with the unmasked model
that  0 =  · ⃗1 for some small  &gt;
update rules for  at iteration  are given by:</p>
        <p>0. The gradient
 +1 =  −  ·∇ (1 &gt;0)ℒbatch−  · sign(‖1 &gt;0‖1− )⃗1,
where  is the learning rate. We will illustrate in
Section 3.4 that the above updates enjoy certain benefits
and the last term plays an important role by pushing
the optimizer iteratively to evaluate candidate models
with a hard auxiliary mask. Furthermore, to enhance the
stability and performance, we implement a multi-step
training through iteratively training auxiliary parameters
on validation data and retraining the original weights,
which attempts to solve the following bi-level
optimization problem:
min

s.t.</p>
        <p>ℒval
V˜⋆ , Θ
︁( ˜ ⋆</p>
        <p>V , Θ
⋆
︁)</p>
        <p>+  ⃒⃒ ‖1 &gt;0‖1 − ⃒⃒
⋆ = arg min
3.4. On the Gradient Dynamics
As finding the optimal mask is essentially a combinatorial
optimization problem over the set of 2 possible status
of on-of switches (  is the number of switches), the
proposed gradient-based search rule given in (3) provides
an alternative approach to look through the discrete
candidate sets in a continuous space. We elaborate below
that the STE-based gradient in (2) is related to the
Taylor approximation for the Leave-One-Out error and the
penalty term drifts auxiliary variables to find a model of
end
end</p>
        <sec id="sec-1-2-1">
          <title>Algorithm 1: Structural HAM Pruning</title>
          <p>base dimensions  (≤
Input: data train, val, test,
target embedding size , hyperparameters</p>
          <p>) for each field ,
,  &gt;</p>
          <p>0;
◁ Pretrain:
while stopping criteria not met do
obtain a batch of samples from train and
update V, Θ regularized with SO (1) by
certain optimizer;
◁ Search: initialize</p>
          <p>=  ⃗1;
while stopping criteria not met do
update</p>
          <p>by Eq. (3);
obtain a batch of samples from val and
obtain a batch of samples from train and
update V, Θ by certain optimizer;
(3)
◁Retrain: mask the dimensions with 0 where</p>
          <p>&lt; 0 and retrain the model until the stopping
criteria are met.
the target size.</p>
          <p>For illustration purpose, we start with the dynamics
of the scalar-valued parameter  , , the -th element of
 , that controls the mask for the -th column vector,
V, , of the embedding table V. Define the function
ℓ() := ℒ[· V, ] as the value of loss function when
replacing V, by  · V, (0 ≤  ≤ 1). By the first-order
Taylor approximation, we have ℓ(1) − ℓ(0) ≈ ℓ′(0) and
ℓ(0) − ℓ(1) ≈ − ℓ′(1). Moreover, it is worth noting
that the STE-based gradient in (2) with regards to  , is
exactly ℓ′(1 , &gt;0), i.e.,
if # of positive )s &lt; 
external =  &gt; 0
masked 0 - V!$,#$
α!$,#$
0
velocity of α!,# ≈  - (ℒ[%&amp;'!,#] −ℒ['!,#])
α!,#</p>
          <p>
            un-masked V!,#
if # of positive )s &gt; 
external = − &lt; 0
1 , &gt;0 (4) the literature in terms of prediction ability, and memory
consumption? (iii) How does the soft orthogonality
regu= ℒ[V, ] − ℒ [0· V, ]. larity boost the performance of our proposed framework?
In other words, the proposed gradient calculates the im- In the following, we will first introduce the experimental
portance (with batches of data) of the embedding com- setups including datasets, baselines, and implementation
ponent V, using the first-order Taylor approximation. details, and then present results as well as discussions.
The above heuristics are also mentioned in [
            <xref ref-type="bibr" rid="ref10 ref11">36, 37, 25</xref>
            ].
          </p>
          <p>We now consider the dynamics of all the auxiliary 4.1. Datasets and Data Preparation.
parameters  , ’s provided by (3). As illustrated above,
the term ∇(1 &gt;0)ℒdata measures the importance of each
embedding component by approximating the diference
between the loss value with un-masked embeddings and
the loss value with a mask on. Furthermore, the sign
of the penalty term  in (3) is determined by the
comparison between the number of un-masked components
and the target size . As an analogue, the dynamics of
auxiliary parameters can be viewed as a particle system
in a neighborhood of 0 on real line, in which the particles
are initialized at the same point  &gt; 0 (i.e., the model
starts with all embeddings un-masked) and the velocity</p>
        </sec>
        <sec id="sec-1-2-2">
          <title>We use three benchmark datasets in our experiments: (i)</title>
          <p>MovieLens-1M: This dataset contains users’ ratings
(15) on movies. We treat samples with ratings greater than
3 as positive samples and samples with ratings below 3
as negative samples. Neutral samples with a rating equal
to 3 are dropped; (ii) Criteo: This dataset has 45
million users’ clicking records on displayed ads. It contains
26 categorical feature fields and 13 numerical feature
ifelds; (iii) Avazu: This dataset has 40 million clicking
records on displayed mobile ads. It has 22 feature fields
spanning from user/device features to ad attributes. We
follow the common approach to remove the infrequent
features and discretize numerical values. First, we
remove the infrequent feature values and treat them as
a single value “unknown", where the threshold is set to
{10, 4} for Criteo, and Avazu respectively. Second, we
convert all numerical values in into categorical values by
transforming a value  to int(log()2) if int() &gt; 2 and
to int() − 2 otherwise, which is proposed by the winner
of Criteo Competition 2. Third, we randomly split all
training samples into 80% for training, 10% for validation,
and 10% for testing.
of each particle is roughly  · (ℒ[0· V, ] − ℒ [V, ]) by
the approximation in (4). In addition, an external force is
introduced by the penalty term,  ⃒⃒ ‖1 &gt;0‖1 − ⃒⃒ , which
can be interpreted as the wind flow with its velocity
being −  &lt; 0 when the number of positive particles is
larger than  and being  &gt; 0 otherwise. As a result,
when the number of positive particles exceeds , those
particles with smaller ℒ[0· V, ] − ℒ [V, ] tend to be
negative, and vice versa; see Figure 2 for further illustration
of the proposed algorithm.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Experiments</title>
      <p>To validate the performance of our proposed framework,
we conduct extensive experiments on three real-world
recommendation datasets. Through the experiments, we
seek to answer the following three questions: (i) How
does our proposed method, HAM, perform compared with
other auxiliary mask pruning (AMP) methods? (ii) How
does our proposed framework perform compared with
the existing field-wise embedding size search methods in
4.2. Baselines
We compare our proposed pruning method HAM with
several baseline methods below.
• Uniform: The method assigns a uniform embedding
size for all feature fields;
• Auxiliary Mask Pruning: We compare our proposed
method with other common approaches that apply an</p>
      <sec id="sec-2-1">
        <title>2https://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf</title>
        <p>
          DNIS [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
auxiliary mask to the embeddings. To be specific, we
select three representative types of the auxiliary mask
listed in Table 1 and apply the same one-shot NAS
framework as described in Algorithm 1:
(i) SAM: the embeddings are masked directly by
auxiliary weights 0 ≤  ≤ 1, which is equivalent to
(ii) SAM-GS: the auxiliary mask is obtained by the
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Gumbel-sigmoid function of auxiliary parameters</title>
        <p>∈ (0, 1) and  ∼</p>
        <p>Uniform(0,1) be;ow
Sigmoid[(log(
) + log(</p>
        <p>))/ ].</p>
        <p>1
−</p>
        <p>
          1
− 
(iii) HAM-p: the mask is generated by Bernoulli random
variables with parameters ’s when forwarding the
network. The gradient for ’s is calculated by STE.
In the retraining stage, we keep the embedding
components with top- auxiliary weights for fair
comparisons.
• AutoDim [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]: the state-of-the-art NAS-based method
in the literature, which outperforms a list of search
methods [
          <xref ref-type="bibr" rid="ref5 ref7 ref8">8, 10, 5, 7</xref>
          ]; see Section 3.4 in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
4.3. Implementation Details
        </p>
        <sec id="sec-2-2-1">
          <title>Base Model architecture. We adopt four representa</title>
          <p>tive CTR-prediction models in the literature: FM [26],
DeepFM [27], AutoInt [28], DCN-V2 [31], as our base
models to compare their performance. For FM, DeepFM,
and AutoInt, we add an extra embedding vector layer
between the feature interaction layer and the embedding
layer as proposed in [10]. Linear transformations
(without bias) are applied to mixed-dimension embeddings so
that the transformed embedding vectors are of the same
size, which is set as 16 in our experiments. This is not
required for DCN-V2 due to the bit-wise crossing
mechanism. In the pretraining stage, the base dimension  for
each feature field  is chosen as min(16,  ) where 
is the field dimension.
erature to employ Adam optimizer with the learning rate
of 10− 3 to optimize model parameters in all three stages
and use SGD optimizer to optimize auxiliary weights in
the search stage. To fairly compare diferent
ods, the number of search epochs is fixed as 10, and the
learning rates of auxiliary parameters are chosen from
the search grid {1, 10− 1, 10− 2, 10− 3}, and the
temperature of the sigmoid function is chosen from the search
grid {1, 10− 1, 10− 2}. To obtain the best results of each
method, we pick 10− 2 as the learning rate of SGD for
SAM, SAM-p, and HAM-p, 10− 3 for our method HAM. In
HAM, the initial value of auxiliary weights  = 0.01, and
the hyperparameter  = 5
× 10− 5. For the orthogonal
regularity, we employ  ℛ(V) with  = 10− 3 for
training models on MovieLens-1M, and use the pair-wise
cosine similarities  ℛ( V¯) with 
use PyTorch to implement our method and train it with
mini-batch size 2048 on a single 16G-Memory Nvidia
= 10− 6 on Avazu. We</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Tesla V100.</title>
        <p>4.4. Performance Comparison</p>
      </sec>
      <sec id="sec-2-4">
        <title>We next present detailed comparisons with other meth</title>
        <p>ods. We adopt AUC (Area Under the ROC Curve) on test
datasets to measure the performance of models and the
number of embedding parameters to measure memory
usage.</p>
        <sec id="sec-2-4-1">
          <title>Comparison with other AMP Methods. We first</title>
          <p>compare our proposed AMP methods to other common
AMP methods listed in Table 1 on MovieLens-1M. To
fairly compare the performance, we not only consider
comparing the test AUC under the same target total
embedding size  but also take the number of model
parameters into account. In Figure 3, we report the
average values of test AUC under the same total embedding
size and the best results among 10 independent runs in
terms of AUC and plot Test AUC - # Parameter curve on</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>MovieLens-1M. For each method, three measurements</title>
        <p>are provided from left to right with the target total
embedding size  = 14, 28, 42 respectively. We drop the
Test AUC - # Parameter curve of the method with the
worst performance at the bottom. We observe that: (a)
ifxing the target total embedding size, we can tell that
our method HAM outperforms all other methods on four
base models with regard to the value of test AUC. As
illustrated in Test AUC - # Parameter curves, with the same
budget of total embedding size, HAM tends to assign larger
sizes to informative feature fields with larger vocabulary
sizes compared with other methods. This should be
attributed to the gradient dynamics described in Section 3.4
which helps to identify the important embedding
components. As a result, HAM obtains models with higher AUC
under the same total embedding size; (b) it is
interesting to observe that SAM also assigns larger embedding
other methods with the same total embedding size, which
verify the findings in [ 12] that the magnitudes of
auxiliary weights may not indicate the importance; (c) HAM
ture crossing models (FM, DeepFM, AutoInt), but also
on the bit-wise feature crossing model (DCN-V2) where</p>
      </sec>
      <sec id="sec-2-6">
        <title>Uniform serves as a strong baseline. The performance</title>
        <p>of HAM should be credited to the hard selection caused
by indicator functions. With the deterministic binary
mask, the masked model evaluated in the search stage is
equivalent to the selected model in the retraining stage,
as illustrated in Table 1. As a result, we conclude that</p>
      </sec>
      <sec id="sec-2-7">
        <title>HAM exhibits stable performance on all four base models</title>
        <p>AMP meth- exhibits its superiority, not only on the vector-wise
feaOptimization. We follow the common setup in the lit- sizes to relatively un-informative features compared to
DeepFM</p>
        <p>AutoInt</p>
        <p>s=28
Total Embedding Size</p>
        <p>s=42
DeepFM
s=14</p>
        <p>s=28
Total Embedding Size</p>
        <p>s=42
AutoInt
s=14</p>
        <p>s=28
Total Embedding Size</p>
        <p>s=42
DCN-V2</p>
        <p>Uniform
SAM
SAM-GS
HAM-p</p>
        <p>HAM</p>
        <p>s=28
Total Embedding Size</p>
        <p>s=42</p>
        <p>Comparison with AutoDim. We also compare our 0.0092 0.042 0.077 0.14 0.072 0.016 0.26
method with state-of-art NAS-based method AutoDim
with {2, 4, 8} as the candidate embedding size on Figure 5: Cosine similarities between column vectors from
Criteo and Avazu datasets in Figure 4. In our method, the embedding table of genre when pretraining DCN-V2 on
the total embedding size  is set to be 90, 50 for Criteo and MovieLens-1M with and without SO.
Avazu respectively. Our method outperforms AutoDim
by finding models with comparably higher AUC and
smaller sizes. In addition, from computational perspec- Table 2
tives, our approach HAM has several advantages over The test AUC gain by Algorithm 1 with SO under diferent
AutoDim: (a) HAM has a larger search space due to the target embedding sizes on MovieLens-1M and Avazu
structural masking mechanism. For each feature, its
candidate embedding size range from 0 to its base dimension Data Size  FM DeeTpeFsMtAUAuCtoInt
DCN . On the contrary, AutoDim requires manually pre- V2
specifying a set of candidate embedding sizes for each 14 +.0006 +.0029 +.0013 +.0012
feemaptuhraes;iz(eb)thHaAtMAuistomDoirme icnotmropduutcaetsioenxtarlalyspeaficcieenatn. dWtieme ML-1M 4228 ++..00003168 ++..00003334 ++..00000198 ++..00003277
tcioomn,palenxditaygbgyretghaetioopnerfoatrioeancshofcalinftdiindga,tbeasticzhe nwohrimleaHliAzaM- Avazu 6464 ++..00003394 ++..00004349 ++..00002206 ++..00003331
only requires extra element-wise product between the The results are obtained by averaging 10 runs and the bold font
binary mask and embedding vectors. Moreover, HAM can indicates statistically significant under two-sided t-tests
output multiple models of diferent total embedding sizes ( &lt; 0.05).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusions</title>
      <p>On the Orthogonal Regularity. We further analyze
the efect of the orthogonal regularity proposed in our
framework. In Figure 5, we visualize the cosine simi- In this paper, we propose a general auxiliary masking
larities between column vectors from the embedding of pruning framework to search the embedding sizes for
difgenre when pretraining DCN-V2 on MovieLens-1M with ferent feature fields adaptively in today’s recommender
and without SO where the embedding size is set as 8. It systems. The proposed method leverages the indicator
is clear to see that SO helps to reduce the correlations function to search candidate models exactly, which is
within those embedding column vectors. Moreover, we eficient and can be easily applied to various models.
compare the performance of models selected by our pro- Extensive experiments demonstrate it can efectively
reposed Algorithm 1 with the one without SO in Table 2. move redundant embedding dimensions without great
Statistical significant gains in test AUC can be observed performance loss.
for MovieLens-1M and Avazu on all base models while
the training time per epoch with SO only increases by Acknowledgments
about 1.6 times compared to the time without SO.
4.5. Discussions and Future Work</p>
      <sec id="sec-3-1">
        <title>Multi-stage Search. As the initial base dimension</title>
        <p>
          for each feature fields it tunable, we observe a gain,
∼ 0.3%, in test AUC from preliminary experiments on
MovieLens-1M by searching via HAM with a smaller
initial size = 8. This observation confirms the claim made
in recent works [
          <xref ref-type="bibr" rid="ref12">38</xref>
          ] that enlarging search space is
unbeneficial or even detrimental to existing NAS methods.
To address this issue, our method can also be adapted to
design a multi-stage search strategy with a suficiently
larger initial size and stage-wisely prune embeddings
with decreasing target embedding size .
        </p>
        <p>Comparison with PEP. To justify the advantages
of our proposed framework, we also discuss the most
recent Plug-in Embedding Pruning (PEP) [14] method.
As introduced before, PEP is an unstructured pruning
method that retains a large number of zero parameters
and requires the technique of sparse matrix storage. On
the contrary, our method can not only prune the
embedding columns explicitly but also reduce the
parameters in the dense layers while PEP cannot. For example,
HAM ( = 28) can obtain a lighter DCN-V2 model on
MovienLens-1M with only ∼ 8% number of parameters
(3,281/40,241) in the dense layers comparing to the model
with a uniform embedding size 16.</p>
        <p>
          Comparison with SSEDS. During the revisions of
this paper, we noticed that a concurrent work [
          <xref ref-type="bibr" rid="ref13">39</xref>
          ] also
proposes to prune embeddings column-wisely as well
as row-wisely using indicator functions. Our work is
diferent from theirs in two aspects: (i) we focus on
fieldwise search by pruning nearly orthogonal embedding
column vectors; (ii) we use a gradient-descent-based method
in the search stage to solve the bilevel problem on the
training and validation set; the gradient dynamics enable
us to re-activate (un-mask) masked embeddings while
SSEDS directly mask all components with small gradient
magnitudes. Comparing our method with SSEDS and
incorporating our method into SSEDS is our future work.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>The authors are grateful to anonymous reviewers for</title>
        <p>their constructive comments that greatly improved the
presentation of this paper. T. Xiao thanks his Ph.D.
advisor Dr. Krishnakumar Balasubramanian for supports and
helpful discussions during the internship at ByteDance.
in: International Conference on Machine Learning, or propagating gradients through stochastic
neuPMLR, 2020, pp. 1617–1626. rons for conditional computation, arXiv preprint
[9] W.-C. Kang, D. Z. Cheng, T. Chen, X. Yi, D. Lin, arXiv:1308.3432 (2013).</p>
        <p>L. Hong, E. H. Chi, Learning multi-granular quan- [23] E. Jang, S. Gu, B. Poole, Categorical
reparameteritized embeddings for large-vocab categorical fea- zation with gumbel-softmax, in: International
Contures in recommender systems, in: Companion ference on Learning Representations, 2016.
Proceedings of the Web Conference 2020, 2020, pp. [24] S. Srinivas, A. Subramanya, R. Venkatesh Babu,
562–566. Training sparse neural networks, in: Proceedings
[10] A. A. Ginart, M. Naumov, D. Mudigere, J. Yang, of the IEEE conference on computer vision and
patJ. Zou, Mixed dimension embeddings with applica- tern recognition workshops, 2017, pp. 138–145.
tion to memory-eficient recommendation systems, [25] M. Ye, D. Choudhary, J. Yu, E. Wen, Z. Chen, J. Yang,
in: 2021 IEEE International Symposium on Infor- J. Park, Q. Liu, A. Kejariwal, Adaptive
dense-tomation Theory (ISIT), IEEE, 2021, pp. 2786–2791. sparse paradigm for pruning online
recommen[11] H. Liu, K. Simonyan, Y. Yang, Darts: Diferentiable dation system with non-stationary data, arXiv
architecture search, in: International Conference preprint arXiv:2010.08655 (2020).</p>
        <p>on Learning Representations, 2018. [26] S. Rendle, Factorization machines, in: 2010 IEEE
In[12] R. Wang, M. Cheng, X. Chen, X. Tang, C.-J. Hsieh, ternational conference on data mining, IEEE, 2010,
Rethinking architecture selection in diferentiable pp. 995–1000.</p>
        <p>nas, arXiv preprint arXiv:2108.04392 (2021). [27] H. Guo, R. Tang, Y. Ye, Z. Li, X. He, Deepfm: a
[13] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, factorization-machine based neural network for ctr
Q. Le, Understanding and simplifying one-shot prediction, in: Proceedings of the 26th International
architecture search, in: International Conference Joint Conference on Artificial Intelligence, 2017, pp.
on Machine Learning, PMLR, 2018, pp. 550–559. 1725–1731.
[14] S. Liu, C. Gao, Y. Chen, D. Jin, Y. Li, Learnable em- [28] W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang,
bedding sizes for recommender systems, in: Inter- J. Tang, Autoint: Automatic feature interaction
national Conference on Learning Representations, learning via self-attentive neural networks, in:
Pro2020. ceedings of the 28th ACM International Conference
[15] D. Blalock, J. J. G. Ortiz, J. Frankle, J. Guttag, What is on Information and Knowledge Management, 2019,
the state of neural network pruning?, arXiv preprint pp. 1161–1170.</p>
        <p>arXiv:2003.03033 (2020). [29] R. Wang, B. Fu, G. Fu, M. Wang, Deep &amp; cross
[16] P. Savarese, H. Silva, M. Maire, Winning the lottery network for ad click predictions, in: Proceedings
with continuous sparsification, in: Advances in of the ADKDD’17, 2017, pp. 1–7.</p>
        <p>Neural Information Processing Systems, volume 33, [30] X. He, T.-S. Chua, Neural factorization machines for
Curran Associates, Inc., 2020, pp. 11380–11390. sparse predictive analytics, in: Proceedings of the
[17] C. Louizos, M. Welling, D. P. Kingma, Learning 40th International ACM SIGIR conference on
Resparse neural networks through l_0 regularization, search and Development in Information Retrieval,
in: International Conference on Learning Represen- 2017, pp. 355–364.</p>
        <p>tations, 2018. [31] R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin,
[18] S. Xie, H. Zheng, C. Liu, L. Lin, Snas: stochastic L. Hong, E. Chi, Dcn v2: Improved deep &amp; cross
neural architecture search, in: International Con- network and practical lessons for web-scale
learnference on Learning Representations, 2018. ing to rank systems, in: Proceedings of the Web
[19] Y. Guo, A. Yao, Y. Chen, Dynamic network surgery Conference 2021, 2021, pp. 1785–1797.
for eficient dnns, in: Proceedings of the 30th In- [32] N. Bansal, X. Chen, Z. Wang, Can we gain more
ternational Conference on Neural Information Pro- from orthogonality regularizations in training deep
cessing Systems, 2016, pp. 1387–1395. networks?, Advances in Neural Information
Pro[20] X. Xiao, Z. Wang, Autoprune: Automatic network cessing Systems 31 (2018) 4261–4271.
pruning by regularizing auxiliary parameters, Ad- [33] P. Rodríguez, J. Gonzalez, G. Cucurull, J. M. Gonfaus,
vances in Neural Information Processing Systems X. Roca, Regularizing cnns with locally constrained
32 (NeurIPS 2019) 32 (2019). decorrelations, arXiv preprint arXiv:1611.01967
[21] X. Zhou, W. Zhang, H. Xu, T. Zhang, Efective spar- (2016).</p>
        <p>sification of neural networks with global sparsity [34] S. Yan, Y. Zheng, W. Ao, X. Zeng, M. Zhang, Does
constraint, in: Proceedings of the IEEE/CVF Confer- unsupervised architecture representation learning
ence on Computer Vision and Pattern Recognition, help neural architecture search?, Advances in
Neu2021, pp. 3599–3608. ral Information Processing Systems 33 (2020) 12486–
[22] Y. Bengio, N. Léonard, A. Courville, Estimating 12498.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Cheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Diferentiable neural input search for recommender systems</article-title>
          , arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>04466</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <article-title>Memory-eficient embedding for recommendations</article-title>
          , arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>14827</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <article-title>Deep learning based recommender system: A survey and new perspectives, ACM Computing Surveys (CSUR) 52 (</article-title>
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Covington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Adams</surname>
          </string-name>
          , E. Sargin,
          <article-title>Deep neural networks for youtube recommendations</article-title>
          ,
          <source>in: Proceedings of the 10th ACM conference on recommender systems</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>191</fpage>
          -
          <lpage>198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , Autoemb:
          <article-title>Automated embedding dimensionality search in streaming recommendations</article-title>
          , arXiv preprint arXiv:
          <year>2002</year>
          .
          <volume>11252</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Automated embedding size search in deep recommender systems</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2307</fpage>
          -
          <lpage>2316</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Joglekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. K.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Khaitan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Neural input search for large scale recommendation models</article-title>
          ,
          <source>in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2387</fpage>
          -
          <lpage>2397</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Diferentiable product quantization for end-to-end embedding compression,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>H.</given-names>
            <surname>Hazimeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sathiamoorthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mazumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <article-title>Dselect-k: Diferentiable selection in the mixture of experts with applications to multi-task learning</article-title>
          ,
          <source>arXiv preprint arXiv:2106.03760</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>P.</given-names>
            <surname>Molchanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tyree</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Karras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Aila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kautz</surname>
          </string-name>
          ,
          <article-title>Pruning convolutional neural networks for resource eficient inference</article-title>
          ,
          <source>arXiv preprint arXiv:1611.06440</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>P.</given-names>
            <surname>Molchanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mallya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tyree</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Frosio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kautz</surname>
          </string-name>
          ,
          <article-title>Importance estimation for neural network pruning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>11264</fpage>
          -
          <lpage>11272</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Ouyang,
          <article-title>Evolving search space for neural architecture search</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>6659</fpage>
          -
          <lpage>6669</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>L.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>Singleshot embedding dimension search in recommender system</article-title>
          ,
          <source>arXiv preprint arXiv:2204.03281</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>