=Paper= {{Paper |id=Vol-3317/paper2 |storemode=property |title=Field-wise Embedding Size Search via Structural Hard Auxiliary Mask Pruning for Click-Through Rate Prediction |pdfUrl=https://ceur-ws.org/Vol-3317/Paper2.pdf |volume=Vol-3317 |authors=Tesi Xiao,Xia Xiao,Ming Chen,Youlong Chen |dblpUrl=https://dblp.org/rec/conf/cikm/XiaoXCC22 }} ==Field-wise Embedding Size Search via Structural Hard Auxiliary Mask Pruning for Click-Through Rate Prediction== https://ceur-ws.org/Vol-3317/Paper2.pdf
Field-wise Embedding Size Search via Structural Hard
Auxiliary Mask Pruning for Click-Through Rate Prediction
Tesi Xiao1 , Xia Xiao2 , Ming Chen2 and Youlong Cheng2,*
1
    University of California, Davis, USA
2
    ByteDance, Mountain View, USA


                                          Abstract
                                          Feature embeddings are one of the most important steps when training deep learning based Click-Through Rate prediction
                                          models, which map high-dimensional sparse features to dense embedding vectors. Classic human-crafted embedding size
                                          selection methods are shown to be β€œsub-optimal" in terms of the trade-off between memory usage and model capacity. The
                                          trending methods in Neural Architecture Search (NAS) have demonstrated their efficiency to search for embedding sizes.
                                          However, most existing NAS-based approaches suffer from expensive computational costs, the curse of dimensionality of
                                          the search space, and the discrepancy between continuous search space and discrete candidate space. Other methods that
                                          prune embeddings in an unstructured manner fail to explicitly reduce the computational costs. In this paper, to address those
                                          limitations, we propose a novel strategy that searches for the optimal mixed-dimension embedding scheme by structurally
                                          pruning a super-net via Hard Auxiliary Mask. Our method aims to directly search candidate models in the discrete space
                                          using a simple and efficient gradient-based method. Furthermore, we introduce orthogonal regularity on embedding tables to
                                          reduce correlations within embedding columns and enhance representation capacity. Extensive experiments demonstrate it
                                          can effectively remove redundant embedding dimensions without great performance loss.

                                          Keywords
                                          CTR Prediction, Embedding Size, Network Pruning, Neural Architecture Search



1. Introduction                                                                                        embedding dimension for all features, either due to the
                                                                                                       prerequisites of the model input or simply for the sake
Deep learning based recommender systems (DLRS) have of convenience. If the embedding dimensions are uni-
demonstrated superior performance over more tradi- formly high, it leads to increased memory usage and
tional recommendation techniques [3]. The success of computational cost, as it fails to handle the heterogeneity
DLRS is mainly attributed to their ability to learn mean- among different features. As a concrete example, encod-
ingful representations with categorical features, that sub- ing features with few unique values like gender with large
sequently help with modeling the non-linear user-item embedding vectors leads to over-parametrization. Con-
relationships efficiently. Indeed, real-world recommenda- versely, the selected embedding size may be insufficient
tion tasks usually involve a large number of categorical for highly-predictive features with large cardinalities,
feature fields with high cardinality (i.e. the number of such as the user’s last search query. Therefore, finding
unique values or vocabulary size) [4]. One-Hot Encoding appropriate embedding dimensions for different feature
is a standard way to represent such categorical features. fields is essential.
To reduce the memory cost of One-Hot Encoding, DLRS                                                       The existing works towards automating embedding
first maps the high-dimensional one-hot vectors into size search can be categorized into two groups: (i) field-
real-valued dense vectors via the embedding layer. Such wise search [5, 2, 6]; (ii) vocabulary-wise search [7, 8, 9, 10].
embedded vectors are subsequently used in predictive The former group aims to assign different embedding
models for obtaining the required recommendations.                                                     sizes to different feature fields, while embeddings in the
   In this pipeline, the choice of the dimension of the same feature field share the same dimension. The lat-
embedding vectors, also known as embedding dimension, ter further attempts to find different embedding sizes
plays a crucial role in the overall performance of the to different feature values within the same feature field,
DLRS. Most existing models assign a fixed and uniform which is generally based on the frequencies of feature
DL4SR’22: Workshop on Deep Learning for Search and Recommen- values. Although it has been shown that the latter group
dation, co-located with the 31st ACM International Conference on can significantly reduce the model size without a great
Information and Knowledge Management (CIKM), October 17-21, 2022, performance drop, the second group of works suffers
Atlanta, USA                                                                                           from several challenges and drawbacks (we refer readers
*
  Corresponding author.
                                                                                                       to Section 3.4 in [2] for details): (a) a large number of
$ texiao@ucdavis.edu (T. Xiao); x.xiaxiao@bytedance.com
(X. Xiao); ming.chen@bytedance.com (M. Chen);                                                          unique values in each feature field leads to a huge search
youlong.chen@bytedance.com (Y. Cheng)                                                                  space in which the optimal solution is difficult to find;
          Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   (b) the feature values frequencies are time-varying and
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                                 𝑦!
         Our Framework                                                                                       DNIS [Cheng et al., 2020]
                                                    Interaction Layers
                                                                                                                                                   Magnitude-based Pruning
                                                                                                           Soft Auxiliary Mask   0.6 0.1 0.9 0.8

                                                                                                                                       βŠ™             -0.36 0.03 -0.18 0.96

                               1    0    1                 1     0       1   1        0     1   1
        Hard Auxiliary Mask                                                                                 Embedding Vector -0.6 0.3 -0.2 0.12         Output
                                    βŠ™                                βŠ™                     βŠ™                                                           Embedding
                              1.2   1   -0.9           -0.6 0.3 -0.2 0.12             0.1 0.3 -2.1                                                       Vector
        Embedding Vector                                                                                     AutoDim [Zhao et al., 2020]
                                               Embedding
                                                Look-up
                              0.4 -0.2 -0.3             0.2 -1 0.1 -0.6               -1   0.2 -0.1                       Linear Transform / Zero Padding
                              -0.2 0.1 -0.8            -0.5 0.7 -0.3 0.2              0.1 0.3 -2.1
                                                                                                                                                     Gumbel-softmax
        Embedding Table                        V!                                V"                   V#                                                                      Output
                              1.2   1   -0.9           -0.6 0.3 -0.2 0.12             1.2 -0.1 0.1                                                              +            Embedding
                              0.3 0.2 0.7               0.1 0.2 0.5 -0.4              0.7 0.6 -0.4                                           Batch Norm + Weighted Sum        Vector
                                                                                                           Embedding Lookup
                                                                                                           with Memory Sharing
                              Field 1                          Field 𝑗                 Field 𝐾


Figure 1: (left) The framework of our method. The operator βŠ™ denotes the element-wise product. The gray embedding
components are masked and thus can be removed directly. (right) The basic ideas of DNIS [1] and AutoDim [2].



not pre-known in real-time recommender system; (c) it is                                              into the same dimension with batch normalization so
difficult to handle embeddings with different dimensions                                              that they can be aggregated with architecture weights
for the same feature field in the state of art DLRS due to                                            of candidate sizes and fed into the subsequent layers to
the feature crossing mechanism. As a result, we fix our                                               obtain the prediction. Cheng et al. [1] also leverage the
attention to the field-wise embedding size search in this                                             idea of DARTS but with the architecture weights being
work.                                                                                                 the weights of sub-blocks of embedding matrices.
    The majority of recent works towards automating em-                                                  Despite the obtained promising results, the existing
bedding size search are based on the trending ideas in                                                methods have certain limitations which arise from the
Neural Architecture Search (NAS). NAS has been an ac-                                                 discrepancy between the large discrete candidate space
tive research area recently, which is composed of three                                               and its continuous relaxation. On the one hand, the
elements: (i) a search space π’œ of all candidate models;                                               DARTS-based methods in [5, 2, 1] are algorithmically
(ii) a search strategy that goes through models in π’œ; (iii)                                           efficient but search in an augmented continuous space;
a performance estimation strategy that evaluates the per-                                             it has been shown that the discrepancy may lead to
formance of the selected model. To leverage the NAS                                                   finding unsatisfactory candidates because the magnitude
techniques to search embedding sizes for each feature                                                 of architecture weights does not necessarily indicate
field, Joglekar et al. [7] formulate the search space by dis-                                         how much the operation contributes to the overall
cretizing an embedding matrix into several sub-matrices                                               performance [12]. On the other hand, while the RL-based
and take a Reinforcement Learning (RL) approach with                                                  methods in [7, 6] and the hard selection variant in [5]
the Trainer-Controller framework for search and per-                                                  directly search and evaluate models in the original
formance estimation, in which the controller samples                                                  discrete candidate space, they are computationally heavy
different embedding blocks to maximize the reward ob-                                                 and thus not favored by the large-scale DLRS. Therefore,
served from the trainer that trains the model with the                                                a natural question follows:
controller’s choices. Liu et al. [6] also adopt the RL-based
search which starts with small embedding sizes and takes                                                 Is it possible to come up with an efficient method that
actions to keep or enlarge the size based on the predic-                                              searches straight through the discrete candidate space of
tion error. Their proposed method requires sequentially                                               all possible embedding sizes?
training DRLS and the policy network repeatedly until
convergence, which is computationally expensive.                                                         In this work, we provide a positive answer by propos-
    Given the rise of one-shot NAS aiming to search                                                   ing an end-to-end field-aware embedding size search
and evaluate candidate models together in an over-                                                    method leveraging the Straight Through Estimator (STE)
parametrized supernet, Zhao et al. [5, 2] follow the idea                                             of gradients for non-differentiable functions. Motivated
of Differential Neural Architecture Search (DARTS)[11]                                                by one-shot NAS and network pruning, our method seeks
and solve a continuous bilevel optimization by the                                                    to adaptively search for a good sub-network in a pre-
gradient-based algorithm. To be specific, they define                                                 trained supernet by masking redundant dimensions. The
a search space that includes all possible embedding sizes                                             mask, named as Hard Auxiliary Mask (HAM), is real-
of each feature field; for instance in [2], 5 candidate                                               ized by the indicator function of auxiliary weights. Fur-
sizes {2, 8, 16, 24, 32} are selected for each feature field.                                         thermore, to reduce the information redundancy and
These embedding vectors in different sizes are then lifted                                            boost the search performance, we impose the orthogonal
regularity on embedding tables and train models with            puts, existing works attempt to cast it into a differentiable
regularized loss functions. Contributions and nov-              problem and provide gradient-based search algorithms. A
elty of our work are: (i) we propose to prune embed-            straightforward method is to directly replace binary val-
ding tables column-wisely via hard auxiliary masks for          ues by smooth functions of auxiliary parameters during
the CTR prediction models, which can effectively com-           the forward pass, such as sigmoid [16], Gumbel-sigmoid
press the model; (ii) we offer a gradient-based pruning         [17], softmax [11], Gumbel-softmax [18], and piece-wise
method and delve into its dynamics with mathematical            linear [19]. However, all these methods, which we name
insights; (iii) we introduce orthogonal regularity for train-   Soft Auxiliary Mask (SAM), suffer from the discrepancy
ing recommendation models and experimentally show               caused by continuous relaxation. Other methods, which
that orthogonal regularity can help to achieve significant      we refer to Hard Auxiliary Mask (HAM), preserve binary
improvement; (iv) we obtain state-of-the-art results on         values in the forward pass with indicator functions [20]
various modern models and our method is scalable on             or Bernoulli random variables [21] and optimize parame-
large datasets.                                                 ters using the straight-through estimator (STE) [22, 23].
                                                                A detailed comparison of four representative auxiliary
                                                                masking methods for approximating binary values is pro-
2. Related Work                                                 vided in Table 1.
Embedding Size Search for Recommender Systems.
To deal with the heterogeneity among different features         3. Methodology
and reduce the memory cost of sizable embedding tables,
several recent works introduce a new paradigm of mixed-         3.1. Preliminaries
dimension embedding tables. In particular, Ginart et al.
[10] propose to substitute the uniform-dimension embed-         Here we briefly introduce the mechanism of DLRS and
dings V ∈ R𝐢×𝑑 with V = EP, where E ∈ R𝐢×𝑠 is                   define the terms used throughout the paper.
the smaller embeddings with 𝑠 β‰ͺ 𝑑 and P ∈ R𝑠×𝑑 is the              Model Architecture. Consider the data input
projection that lift the embeddings into the base dimen-        involves 𝐾 features fields from users, items, their
sion 𝑑 for feature crossing purposes. A popularity-based        interactions, and contextual information. We de-
rule is further designed to find the dimension 𝑠 for each       note these raw features by multiple one-hot vectors
feature field which is too rough and cannot handle impor-       x1 ∈ R𝐢1 , . . . , x𝐾 ∈ R𝐢𝐾 , where field dimensions
tant features with low frequency. In addition, plenty of        𝐢1 , . . . 𝐢𝐾 are the cardinalities of feature fields1 . Ar-
works seek to provide an end-to-end framework leverag-          chitectures of DLRS often consist of three key com-
ing the advances in NAS, including RL approaches [7, 6]         ponents: (i) embedding layers with tables V1 ∈
and differentiable NAS methods [5, 13, 2, 1]. The most          R𝐢1 ×𝑑1 , . . . , V𝐾 ∈ R𝐢𝐾 ×𝑑𝐾 that map sparse one-hot
relevant work to us is [1], which employs a soft auxiliary      vectors to dense vectors in a low dimensional embedding
mask to prune the embeddings structurally. However, it          space by v𝑖 = Vπ‘–βŠ€ x𝑖 ; (ii) feature interaction layers that
requires magnitude-based pruning to derive fine-grained         model complex feature crossing using embedding vec-
mixed embedding, in which the discrepancy between the           tors; (iii) output layers that make final predictions for
continuous relaxed search space and discrete candidate          specific recommendation tasks.
space could lead to a relatively great loss in performance.        The feature crossing techniques of existing models
The most recent work [14] proposes a Plug-in Embed-             fall into two types - vector-wise and bit-wise. Models
ding Pruning (PEP) approach to prune embedding tables           with vector-wise crossing explicitly introduce interac-
into sparse matrices to reduce storage costs, Albeit sig-       tions by the inner product, such as Factorization Machine
nificantly reducing the number of non-zero embedding            (FM) [26], DeepFM [27] and AutoInt [28]. The bit-wise
parameters, this type of unstructured pruning method            crossing, in contrast, implicitly adds interaction terms
fails to explicitly reduce the embedding size.                  by element-wise operations, such as the outer product
   Neural Network Pruning via Auxiliary Masks. To               in Deep Cross Network (DCN) [29], and the Hadamard
deploy deep learning models on resource-limited devices,        product in NFM [30] and DCN-V2 [31]. In this work, we
there is no lack of work in neural network pruning; see         deploy our framework to the Click-Through Rate (CTR)
[15] and the reference therein. One popular approach            prediction problem with four base models: FM, DeepFM,
towards solving this problem is to learn a pruning mask         AutoInt, and DCN-V2.
through auxiliary parameters, which considers pruning              Notations. Throughout this paper, we use βŠ™ for
as an optimization problem that tends to minimize the           the Hadamard (element-wise) product of two vectors.
supervised loss of the masked neural network with a             The indicator function of a scalar-valued 𝛼 is defined
certain sparsity constraint. As learning the optimal mask       as 1𝛼>0 = 1 if 𝛼 > 0; 1𝛼>0 = 0 if 𝛼 ≀ 0. The
is indeed a discrete optimization problem for binary in-        1
                                                                    Numeric features are converted into categorical data by binning.
Table 1
A List of Representative Auxiliary Masking Methods
           Mask                             Forward                            Backward              Modeleval = Modelsel
           Deterministic     𝛼 ∈ [0, 1][1]; Sigmoid(𝛼), 𝛼 ∈ R[16]               autograd                       βœ—
  Soft                                                 𝑒
                              Sigmoid(log 𝛼 + log( 1βˆ’π‘’   )) [17],
             Stochastic                                                        autograd                        βœ—
                                  𝛼 ∈ R, 𝑒 ∼ Uniform(0,1)
             Stochastic                   Bernoulli(p)               STE [24], Gumbel-STE [21]                βœ—
 Hard
            Deterministic                     1𝛼>0                          STE [20, 25]               (Our Approach) βœ“

       Remark. Modeleval stands for the evaluated masked model; Modelsel is the final model selected by the algorithm.


indicator function of a vector 𝛼 ∈ R𝑑 is defined as             probability score π‘¦Λœ is given as follows:
1𝛼>0 = [1𝛼1 >0 , . . . , 1𝛼𝑑 >0 ]⊀ . The identity matrix is     π‘¦Λœ = πœ“(v βŠ™ 1𝛼>0 |Θ) = πœ‘(x|Ṽ𝛼 , Θ) = πœ‘(x|𝛼, V, Θ),
written as I and the function diag(Β·) returns a diagonal                            𝛼1        𝛼
matrix with its diagonal entries as the input. We use           where Ṽ𝛼 = {VΜƒ1 , . . . , Ṽ𝐾𝐾 } is the pruned embed-
β€– Β· β€–1 , β€– Β· ‖𝐹 for the β„“1 norm of vectors and the Frobenius    ding layer with
                                                                               = V𝑖 diag(1𝛼𝑖 >0 ),
                                                                          𝛼𝑖
norm of matrices respectively.                                         Ṽ𝑖                             𝑖 = 1, . . . 𝐾.
                                                                                                             𝛼𝑖
                                                                We emphasize that embedding tables Ṽ𝑖 are pruned
3.2. Background                                                 column-wisely in a structural manner unlike PEP [14]
                                                                that prunes entry-wisely.
In general, the CTR prediction models takes the concate-
                                                                   Orthogonal Regularity Given the embedding table
nation of all feature fields from a user-item pair, denoted
                                                                V𝑗 ∈ R𝐢𝑗 ×𝑑𝑗 for feature field 𝑗, its column vectors, de-
by x = [x1 ; x2 ; . . . ; x𝐾 ] as the input vector. Given the
                                                                noted by V𝑗,1 , . . . , V𝑗,𝑑𝑗 ∈ R𝐢𝑗 , can be regarded as 𝑑𝑗
embedding layer V = {V1 , V2 , . . . , V𝐾 }, the feature
                                                                different representations of feature 𝑗 in the embedding
interaction layers take the embedding vectors v and feed
                                                                space. The auxiliary masks above aim to mask relatively
the learned hidden states into the output layer to obtain
                                                                uninformative column vectors so as to reduce the model
the prediction. The embedding vectors v are the dense
                                                                size. Nonetheless, the presence of correlation between
encoding of the one-hot input x that can be defined as
                                                                these vectors may complicate the selection procedure.
follows:
                                                                Specifically, presuming that the most predictive one V𝑗,𝑝
         v = [v1 ; v2 ; . . . ; v𝐾 ]                            has been selected, it would be problematic if we greedily
             [︁                             ]︁                  select the next column V𝑗,π‘ž that brings the largest loss
           = V1⊀ x1 ; V2⊀ x2 ; . . . ; V𝐾
                                        ⊀
                                          x𝐾 := 𝒱x,
                                                                drop when included in. For instance, if V𝑗,π‘ž ΜΈβŠ₯ V𝑗,𝑝 ,
where 𝒱 is the embedding lookup operator. The predic-           we have the following decomposition: V𝑗,π‘ž = p + pβŠ₯ ,
tion score 𝑦ˆ is then calculated with models’ other param-      where p = 𝑐V𝑗,𝑝 β€– V𝑗,𝑝 and pβŠ₯ βŠ₯ p. Therefore, it
eters Θ in the feature interaction layers and output layer      would be difficult to determine whether the increments
by 𝑦ˆ = πœ“(v|Θ) = πœ“(𝒱x|Θ) = πœ‘(x|V, Θ), where                     are attributed to the existing direction p or the new fac-
𝑦ˆ ∈ [0, 1] is the predicted probability, πœ“(Β·|Θ) is the pre-    tor pβŠ₯ . To address this issue, we follow [32] to train
diction function of embedding vectors, and πœ‘ = πœ“ ∘ 𝒱 is         embedding parameters V with Soft Orthogonal (SO) reg-
the prediction function of raw inputs. To learn the model       ularizations:
                                                                                         𝐾
parameters V, Θ, we aim to minimize the Log-loss on                                     βˆ‘οΈ
                                                                            β„›(V) =           β€–Vπ‘—βŠ€ V𝑗 βˆ’ Iβ€–2𝐹 /𝑑2𝑗 ,       (1)
the training data, i.e.,
                                                                                      𝑗=1
min β„’train (V, Θ)                                               where divisors 𝑑2𝑗 are introduced to handle heteroge-
 V,Θ
               𝑁
                                                                neous dimensionality of embedding tables. We also adopt
            1 βˆ‘οΈ [οΈ€                                       ]οΈ€    a relaxed SO regularization in which V𝑗 replaced by the
       := βˆ’         𝑦𝑗 log(𝑦ˆ𝑗 ) + (1 βˆ’ 𝑦𝑗 ) log(1 βˆ’ 𝑦ˆ𝑗 )
            𝑁 𝑗=1                                               normalized matrix V𝑗 with unit column vectors, which
                                                                corresponds to the pair-wise cosine similarities within
where 𝑁 is the total number of training samples.                embedding columns [33].
   Hard Auxiliary Mask As illustrated in Figure 1, we
add auxiliary masks for each embedding dimension slot.
                                                                3.3. Framework
Specifically, the auxiliary masks are indicator functions
of auxiliary parameters 𝛼 = [𝛼1 ; 𝛼2 ; . . . ; 𝛼𝐾 ], where      Our proposed framework is motivated by one-shot NAS,
𝛼𝑖 ∈ R𝑑𝑖 is in the same size of the corresponding embed-        which consists of three stages: pretrain, search, and re-
ding vector v𝑖 . Provided with the masks, the predicted         train.
   Pretrain. As shown in [34], pre-training architecture          Early Stopper for the Search Stage. We can deter-
representations improve the downstream architecture           mine to stop the search stage if the number of positive
search efficiency. Therefore, in the pretraining stage, we    auxiliary weights is close to the target size with no signif-
train the base model with a large embedding layer V.          icant gain in validation AUC since the sign of auxiliary
The base dimension 𝑑𝑗 for each feature field is deter-        weights exactly indicates whether the corresponding em-
mined by prior knowledge. In addition, the embedding          bedding components are pruned or not. On the contrary,
dimension 𝑑𝑗 should not exceed the field dimension 𝐢𝑗         it is hard to deploy the early stopper for the search stage
to avoid column-rank-deficiency. The SO regularization        using other auxiliary mask pruning methods because the
term (1) is added to the mini-batch training loss for the     value of validation AUC during the search epochs is un-
optimizer to learn near-orthogonal embeddings. The            able to represent the performance of the model selected
learned model parameters, denoted by Vinit , Θinit , are      in the retraining stage.
passed to the search stage as initialization.                     Retrain. After training {𝛼, V, Θ} for several epochs
   Search. Provided with the pre-trained model, the goal      in the search step, we obtain a binary mask based on
of the search stage is to find the column-wise sparse         the sign of 𝛼 and continue optimizing {V, Θ} till con-
embedding layer that preserves model accuracy, which          vergence. The early stopper is deployed for both pre-
can be formulated as:                                         training and retraining steps, which terminates training
    min min β„’train Ṽ𝛼 , Θ + πœ‡βƒ’β€–1𝛼>0 β€–1 βˆ’ 𝑠⃒,                 if model performance on validation data cannot be im-
                      (︁       )︁     βƒ’              βƒ’
      𝛼 V,Θ                                                   proved within a few epochs. The overall framework is
where β€–1𝛼>0 β€–1 counts the number of non-zero embed- described in Algorithm 1.
ding columns and 𝑠 is the target number of non-zero
columns. Note that instead of direct regularization on          Algorithm 1: Structural HAM Pruning
β€–1𝛼>0 β€–1 as [16, 20, 25] do, we include the target num-           Input: data π’Ÿtrain , π’Ÿval , π’Ÿtest ,
ber 𝑠 to reduce instability from batched training and the         base dimensions 𝑑𝑗 (≀ 𝐢𝑗 ) for each field 𝑗,
choice of hyperparameter πœ‡. However, given that the ob-           target embedding size 𝑠, hyperparameters
jective function above is non-differentiable when 𝛼 = 0            πœ‡, πœ– > 0;
and has a zero gradient anywhere else, traditional gra-           ◁ Pretrain:
dient descent methods are not applicable. To that end,            while stopping criteria not met do
we use the straight-through estimator (STE) [22], which               obtain a batch of samples from π’Ÿtrain and
replaces the ill-defined gradient in the chain rule with a             update V, Θ regularized with SO (1) by
fake gradient. Despite various smooth alternatives used                certain optimizer;
in the literature, such as sigmoid [16] and piecewise poly-       end
nomials [35], we adopt the simplest identity function for
                                                                  ◁ Search: initialize 𝛼 = πœ–1  βƒ—;
backpropagation, i.e.,
                                                                  while stopping criteria not met do
  πœ•β„’         πœ•β„’ πœ• 1𝛼>0             πœ•β„’ πœ•π›Ό         πœ•β„’                   obtain a batch of samples from π’Ÿval and
       =                     β‰ˆ              =          . (2)
  πœ•π›Ό       πœ• 1𝛼>0 πœ•π›Ό             πœ• 1𝛼>0 πœ•π›Ό     πœ• 1𝛼>0                  update 𝛼 by Eq. (3);
Then, the search stage starts with the unmasked model                 obtain a batch of samples from π’Ÿtrain and
that 𝛼0 = πœ– Β· βƒ—1 for some small πœ– > 0. The gradient                    update V, Θ by certain optimizer;
update rules for 𝛼 at iteration 𝑑 are given by:                   end
                                                                  ◁Retrain: mask the dimensions with 0 where
𝛼𝑑+1 = 𝛼𝑑 βˆ’πœ‚Β·βˆ‡(1𝛼𝑑 >0 ) β„’batch βˆ’πœ‡Β·sign(β€–1𝛼𝑑 >0 β€–1 βˆ’π‘ )1     βƒ—,
                                                                   𝛼 < 0 and retrain the model until the stopping
                                                         (3)
                                                                   criteria are met.
where πœ‚ is the learning rate. We will illustrate in Sec-
tion 3.4 that the above updates enjoy certain benefits
and the last term plays an important role by pushing
the optimizer iteratively to evaluate candidate models 3.4. On the Gradient Dynamics
with a hard auxiliary mask. Furthermore, to enhance the
stability and performance, we implement a multi-step As finding the optimal mask is essentially a combinatorial
training through iteratively training auxiliary parameters optimization problem over the set of 2 possible status
                                                                                                          𝑆

on validation data and retraining the original weights,       of  on-off switches (𝑆   is the number  of switches), the pro-
which attempts to solve the following bi-level optimiza- posed gradient-based search rule given in (3) provides
tion problem:                                                 an alternative approach to look through the discrete can-
                                                              didate sets in a continuous space. We elaborate below
      min β„’val Ṽ𝛼 , Θ + πœ‡βƒ’β€–1𝛼>0 β€–1 βˆ’ 𝑠⃒
                   (︁ ⋆       )︁
                            ⋆
                                    βƒ’             βƒ’
        𝛼
                                                              that the STE-based gradient in (2) is related to the Tay-
               ⋆
                                          (︁        )︁        lor approximation for the Leave-One-Out error and the
        s.t. Ṽ𝛼 , Θ = arg min β„’train Ṽ𝛼 , Θ .
                      ⋆
                                                              penalty term drifts auxiliary variables to find a model of
                          Ṽ𝛼 ,Θ
the target size.                                                  if # of positive 𝛼 ) s < 𝑠
                                                                                                       velocity of Ξ±!,# β‰ˆ πœ‚ - (β„’[%&'!,#] βˆ’β„’['!,#] )
   For illustration purpose, we start with the dynamics           𝑣external = πœ‡ > 0
of the scalar-valued parameter 𝛼𝑖,𝑗 , the 𝑗-th element of                                                         Ξ±!,#
𝛼𝑖 , that controls the mask for the 𝑗-th column vector,                                            0                     un-masked V!,#

V𝑖,𝑗 , of the embedding table V𝑖 . Define the function           masked 0 - V! $ ,# $
β„“(𝑐) := β„’[𝑐·V𝑖,𝑗 ] as the value of loss function when                                   Ξ±! $,# $
replacing V𝑖,𝑗 by 𝑐 Β· V𝑖,𝑗 (0 ≀ 𝑐 ≀ 1). By the first-order                                                    if # of positive 𝛼 ) s > 𝑠
                                                                                                               𝑣external = βˆ’πœ‡ < 0
Taylor approximation, we have β„“(1) βˆ’ β„“(0) β‰ˆ β„“β€² (0) and
β„“(0) βˆ’ β„“(1) β‰ˆ βˆ’β„“β€² (1). Moreover, it is worth noting
that the STE-based gradient in (2) with regards to 𝛼𝑖,𝑗 is
                                                             Figure 2: An intuitive viewpoint of the gradient dynamics.
exactly β„“β€² (1𝛼𝑖,𝑗 >0 ), i.e.,

    πœ•β„’batch
             = β„“β€² (1𝛼𝑖,𝑗 >0 ) β‰ˆ β„“(1) βˆ’ β„“(0)                  the literature in terms of prediction ability, and memory
   πœ•1𝛼𝑖,𝑗 >0                                           (4)
                                                             consumption? (iii) How does the soft orthogonality regu-
                              = β„’[V𝑖,𝑗 ] βˆ’ β„’[0Β·V𝑖,𝑗 ] .
                                                             larity boost the performance of our proposed framework?
In other words, the proposed gradient calculates the im- In the following, we will first introduce the experimental
portance (with batches of data) of the embedding com- setups including datasets, baselines, and implementation
ponent V𝑖,𝑗 using the first-order Taylor approximation. details, and then present results as well as discussions.
The above heuristics are also mentioned in [36, 37, 25].
   We now consider the dynamics of all the auxiliary 4.1. Datasets and Data Preparation.
parameters 𝛼𝑖,𝑗 ’s provided by (3). As illustrated above,
                                                             We use three benchmark datasets in our experiments: (i)
the term βˆ‡(1𝛼𝑑 >0 ) β„’data measures the importance of each
                                                             MovieLens-1M: This dataset contains users’ ratings (1-
embedding component by approximating the difference
                                                             5) on movies. We treat samples with ratings greater than
between the loss value with un-masked embeddings and
                                                             3 as positive samples and samples with ratings below 3
the loss value with a mask on. Furthermore, the sign
                                                             as negative samples. Neutral samples with a rating equal
of the penalty term πœ‡ in (3) is determined by the com-
                                                             to 3 are dropped; (ii) Criteo: This dataset has 45 mil-
parison between the number of un-masked components
                                                             lion users’ clicking records on displayed ads. It contains
and the target size 𝑠. As an analogue, the dynamics of
                                                             26 categorical feature fields and 13 numerical feature
auxiliary parameters can be viewed as a particle system
                                                             fields; (iii) Avazu: This dataset has 40 million clicking
in a neighborhood of 0 on real line, in which the particles
                                                             records on displayed mobile ads. It has 22 feature fields
are initialized at the same point πœ– > 0 (i.e., the model
                                                             spanning from user/device features to ad attributes. We
starts with all embeddings un-masked) and the velocity
                                                             follow the common approach to remove the infrequent
of each particle is roughly πœ‚ Β· (β„’[0Β·V𝑖,𝑗 ] βˆ’ β„’[V𝑖,𝑗 ] ) by
                                                             features and discretize numerical values. First, we re-
the approximation in (4). In addition,
                                   ⃒ an external⃒force is move the infrequent feature values and treat them as
introduced by the penalty term, πœ‡ β€–1𝛼>0 β€–1 βˆ’ 𝑠 , which
                                                             a single value β€œunknown", where the threshold is set to
                                   βƒ’                βƒ’
can be interpreted as the wind flow with its velocity be-
                                                             {10, 4} for Criteo, and Avazu respectively. Second, we
ing βˆ’πœ‡ < 0 when the number of positive particles is
                                                             convert all numerical values in into categorical values by
larger than 𝑠 and being πœ‡ > 0 otherwise. As a result,
                                                             transforming a value 𝑧 to int(log(𝑧)2 ) if int(𝑧) > 2 and
when the number of positive particles exceeds 𝑠, those
                                                             to int(𝑧) βˆ’ 2 otherwise, which is proposed by the winner
particles with smaller β„’[0Β·V𝑖,𝑗 ] βˆ’ β„’[V𝑖,𝑗 ] tend to be neg-
                                                             of Criteo Competition 2 . Third, we randomly split all
ative, and vice versa; see Figure 2 for further illustration
                                                             training samples into 80% for training, 10% for validation,
of the proposed algorithm.
                                                             and 10% for testing.

4. Experiments                                               4.2. Baselines
To validate the performance of our proposed framework, We compare our proposed pruning method HAM with
we conduct extensive experiments on three real-world several baseline methods below.
recommendation datasets. Through the experiments, we
                                                         β€’ Uniform: The method assigns a uniform embedding
seek to answer the following three questions: (i) How
                                                           size for all feature fields;
does our proposed method, HAM, perform compared with
other auxiliary mask pruning (AMP) methods? (ii) How β€’ Auxiliary Mask Pruning: We compare our proposed
does our proposed framework perform compared with          method with other common approaches that apply an
the existing field-wise embedding size search methods in 2
                                                              https://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf
  auxiliary mask to the embeddings. To be specific, we regularity, we employ πœ†β„›(V) with πœ† = 10βˆ’3 for train-
  select three representative types of the auxiliary mask
                                                       ing models on MovieLens-1M, and use the pair-wise
  listed in Table 1 and apply the same one-shot NAS    cosine similarities πœ†β„›(VΜ„) with πœ† = 10βˆ’6 on Avazu. We
  framework as described in Algorithm 1:               use PyTorch to implement our method and train it with
  (i) SAM: the embeddings are masked directly by aux- mini-batch size 2048 on a single 16G-Memory Nvidia
      iliary weights 0 ≀ 𝛼 ≀ 1, which is equivalent to Tesla V100.
      DNIS [1].
 (ii) SAM-GS: the auxiliary mask is obtained by the          4.4. Performance Comparison
      Gumbel-sigmoid function of auxiliary parameters        We next present detailed comparisons with other meth-
      𝛼 ∈ (0, 1) and 𝑒 ∼ Uniform(0,1) be;ow                  ods. We adopt AUC (Area Under the ROC Curve) on test
                            𝛼            𝑒                   datasets to measure the performance of models and the
           Sigmoid[(log(       ) + log(     ))/πœ†].           number of embedding parameters to measure memory
                           1βˆ’π›Ό          1βˆ’π‘’
                                                             usage.
 (iii) HAM-p: the mask is generated by Bernoulli random         Comparison with other AMP Methods. We first
       variables with parameters 𝑝’s when forwarding the     compare our proposed AMP methods to other common
       network. The gradient for 𝑝’s is calculated by STE.   AMP methods listed in Table 1 on MovieLens-1M. To
                                                             fairly compare the performance, we not only consider
  In the retraining stage, we keep the embedding com-        comparing the test AUC under the same target total em-
  ponents with top-𝑠 auxiliary weights for fair compar-      bedding size 𝑠 but also take the number of model pa-
  isons.                                                     rameters into account. In Figure 3, we report the aver-
β€’ AutoDim [2]: the state-of-the-art NAS-based method         age values of test AUC under the same total embedding
  in the literature, which outperforms a list of search      size and the best results among 10 independent runs in
  methods [8, 10, 5, 7]; see Section 3.4 in [2].             terms of AUC and plot Test AUC - # Parameter curve on
                                                             MovieLens-1M. For each method, three measurements
                                                             are provided from left to right with the target total em-
4.3. Implementation Details                                  bedding size 𝑠 = 14, 28, 42 respectively. We drop the
Base Model architecture. We adopt four representa-           Test AUC - # Parameter curve of the method with the
tive CTR-prediction models in the literature: FM [26],       worst performance at the bottom. We observe that: (a)
DeepFM [27], AutoInt [28], DCN-V2 [31], as our base          fixing the target total embedding size, we can tell that
models to compare their performance. For FM, DeepFM,         our method HAM outperforms all other methods on four
and AutoInt, we add an extra embedding vector layer          base models with regard to the value of test AUC. As il-
between the feature interaction layer and the embedding      lustrated in Test AUC - # Parameter curves, with the same
layer as proposed in [10]. Linear transformations (with-     budget of total embedding size, HAM tends to assign larger
out bias) are applied to mixed-dimension embeddings so       sizes to informative feature fields with larger vocabulary
that the transformed embedding vectors are of the same       sizes compared with other methods. This should be at-
size, which is set as 16 in our experiments. This is not     tributed to the gradient dynamics described in Section 3.4
required for DCN-V2 due to the bit-wise crossing mecha-      which helps to identify the important embedding compo-
nism. In the pretraining stage, the base dimension 𝑑𝑗 for    nents. As a result, HAM obtains models with higher AUC
each feature field 𝑗 is chosen as min(16, 𝐢𝑗 ) where 𝐢𝑗      under the same total embedding size; (b) it is interest-
is the field dimension.                                      ing to observe that SAM also assigns larger embedding
   Optimization. We follow the common setup in the lit-      sizes to relatively un-informative features compared to
erature to employ Adam optimizer with the learning rate      other methods with the same total embedding size, which
of 10βˆ’3 to optimize model parameters in all three stages     verify the findings in [12] that the magnitudes of auxil-
and use SGD optimizer to optimize auxiliary weights in       iary weights may not indicate the importance; (c) HAM
the search stage. To fairly compare different AMP meth-      exhibits its superiority, not only on the vector-wise fea-
ods, the number of search epochs is fixed as 10, and the     ture crossing models (FM, DeepFM, AutoInt), but also
learning rates of auxiliary parameters are chosen from       on the bit-wise feature crossing model (DCN-V2) where
the search grid {1, 10βˆ’1 , 10βˆ’2 , 10βˆ’3 }, and the tempera-   Uniform serves as a strong baseline. The performance
ture of the sigmoid function is chosen from the search       of HAM should be credited to the hard selection caused
grid {1, 10βˆ’1 , 10βˆ’2 }. To obtain the best results of each   by indicator functions. With the deterministic binary
method, we pick 10βˆ’2 as the learning rate of SGD for         mask, the masked model evaluated in the search stage is
SAM, SAM-p, and HAM-p, 10βˆ’3 for our method HAM. In           equivalent to the selected model in the retraining stage,
HAM, the initial value of auxiliary weights πœ– = 0.01, and    as illustrated in Table 1. As a result, we conclude that
the hyperparameter πœ‡ = 5 Γ— 10βˆ’5 . For the orthogonal         HAM exhibits stable performance on all four base models
                                              FM                                                    DeepFM                                                                 AutoInt                                               DCN-V2
                                                                                 0.85                                                 0.85
                        0.84                                                                                                                                                                                  0.84
                                                                                 0.84
                                                                                                                                      0.84
             Test AUC




                                                                      Test AUC




                                                                                                                           Test AUC




                                                                                                                                                                                                   Test AUC
                                                                                                                                                                                                              0.82
                                                                                 0.83
                        0.82                                                                                                          0.83                                                                    0.80
                                                                                 0.82
                                                                                                                                      0.82                                                                    0.78
                        0.80                                                     0.81
                                                                                                                                                                                                                                                           Uniform
                               s=14           s=28         s=42                         s=14           s=28         s=42                     s=14           s=28         s=42                                        s=14           s=28         s=42
                                      Total Embedding Size                                     Total Embedding Size                                 Total Embedding Size                                                    Total Embedding Size           SAM
                                                                                                                                                                                                                                                           SAM-GS
                                              FM                                                    DeepFM                                                                 AutoInt                                               DCN-V2                    HAM-p
                                                                                 0.85
                                                                                                                                                                                                                                                           HAM
                                                                                                                                                                                                              0.85
                                                                                 0.84
                        0.84                                                                                                          0.84
             Test AUC




                                                                      Test AUC




                                                                                                                           Test AUC




                                                                                                                                                                                                   Test AUC
                                                                                                                                                                                                              0.84
                                                                                 0.83

                        0.83                                                                                                          0.83                                                                    0.83
                                                                                 0.82

                                                                                 0.81                                                                                                                         0.82
                        0.82                                                                                                          0.82
                                       0.2    0.3     0.4                                     0.1    0.2     0.3                               0.1      0.2      0.3                                                          0.2         0.4
                                      # Parameters (Γ—105 )                                    # Parameters (Γ—105 )                                # Parameters (Γ—105 )                                                      # Parameters (Γ—105 )
Figure 3: Performance comparison of different auxiliary mask pruning methods on MovieLens-1M. For each method, three
measurements from left to right are reported with the target total embedding size 𝑠 = 14, 28, 42 respectively.



while SAM, SAM-GS, and HAM-p suffer from instability                                                                                          given the same pre-trained model, whereas AutoDim re-
and suboptimality in some circumstances.                                                                                                      quires pretraining the model once more if changing the
                                                                                                                                              candidate embedding size.
                                                              HAM                              AutoDim
                                              Criteo                                                      Avazu                                                                                 Pretrain with SO
                                                                                     0.7900
              0.8125
                                                                                                                                                                                     0.017 0.077 0.033 0.051 0.075 0.028 0.0091                               0.35
                                                                                     0.7875
  Test AUC




                                                                          Test AUC




              0.8100                                                                                                                                                        0.17            0.058 0.0055 0.031 0.0036 0.011 0.023                             0.30
                                                                                     0.7850
              0.8075                                                                                                                                                        0.27     0.14               0.046 0.045 0.013 0.091 0.043                         0.25
                                                                                                                                                     Pretrain without SO




                                                                                     0.7825
              0.8050                                                                                                                                                       0.075 0.34       0.21                       0.046 0.066 0.06 0.053
                                2              4                  6                                10     12        14                                                                                                                                        0.20
                                      # Parameters (Γ—106 )                                         # Parameters (Γ—106 )
                                                                                                                                                                             0.1     0.08   0.22              0.3                 0.04 0.061 0.025
                                                                                                                                                                                                                                                              0.15
Figure 4: Performance comparison between HAM and Au-                                                                                                                        0.17 0.092 0.25                   0.3       0.11              0.028 0.05
toDim. ∘ - FM; β–³ - DeepFM; ⋆ - AutoInt; β–‘ - DCN-V2.                                                                                                                                                                                                           0.10
                                                                                                                                                                           0.065 0.12       0.16          0.37          0.14      0.36             0.021
                                                                                                                                                                                                                                                              0.05
   Comparison with AutoDim. We also compare our                                                                                                                            0.0092 0.042 0.077 0.14 0.072 0.016 0.26
method with state-of-art NAS-based method AutoDim
with {2, 4, 8} as the candidate embedding size on                                                                                             Figure 5: Cosine similarities between column vectors from
Criteo and Avazu datasets in Figure 4. In our method,                                                                                         the embedding table of genre when pretraining DCN-V2 on
the total embedding size 𝑠 is set to be 90, 50 for Criteo and                                                                                 MovieLens-1M with and without SO.
Avazu respectively. Our method outperforms AutoDim
by finding models with comparably higher AUC and
smaller sizes. In addition, from computational perspec-
                                                                                                                                              Table 2
tives, our approach HAM has several advantages over                                                                                           The test AUC gain by Algorithm 1 with SO under different
AutoDim: (a) HAM has a larger search space due to the                                                                                         target embedding sizes on MovieLens-1M and Avazu
structural masking mechanism. For each feature, its can-
                                                                                                                                                                                                                                 Test AUC
didate embedding size range from 0 to its base dimension                                                                                             Data                            Size 𝑠
                                                                                                                                                                                                      FM                      DeepFM AutoInt DCN-
𝑑𝑗 . On the contrary, AutoDim requires manually pre-                                                                                                                                                                                           V2
specifying a set of candidate embedding sizes for each                                                                                                                                14        +.0006                        +.0029 +.0013  +.0012
feature; (b) HAM is more computationally efficient. We                                                                                                                                28        +.0018                        +.0034 +.0018 +.0027
                                                                                                                                                     ML-1M
emphasize that AutoDim introduces extra space and time                                                                                                                                42        +.0036                        +.0033 +.0009  +.0037
complexity by the operations of lifting, batch normaliza-                                                                                                                             44        +.0034                        +.0039 +.0026 +.0031
                                                                                                                                                     Avazu
tion, and aggregation for each candidate size while HAM                                                                                                                               66        +.0039                        +.0044 +.0020 +.0033
only requires extra element-wise product between the                                                                                          The results are obtained by averaging 10 runs and the bold font
binary mask and embedding vectors. Moreover, HAM can                                                                                             indicates statistically significant under two-sided t-tests
output multiple models of different total embedding sizes                                                                                                                 (𝑝 < 0.05).
   On the Orthogonal Regularity. We further analyze          5. Conclusions
the effect of the orthogonal regularity proposed in our
framework. In Figure 5, we visualize the cosine simi-        In this paper, we propose a general auxiliary masking
larities between column vectors from the embedding of        pruning framework to search the embedding sizes for dif-
genre when pretraining DCN-V2 on MovieLens-1M with           ferent feature fields adaptively in today’s recommender
and without SO where the embedding size is set as 8. It      systems. The proposed method leverages the indicator
is clear to see that SO helps to reduce the correlations     function to search candidate models exactly, which is
within those embedding column vectors. Moreover, we          efficient and can be easily applied to various models.
compare the performance of models selected by our pro-       Extensive experiments demonstrate it can effectively re-
posed Algorithm 1 with the one without SO in Table 2.        move redundant embedding dimensions without great
Statistical significant gains in test AUC can be observed    performance loss.
for MovieLens-1M and Avazu on all base models while
the training time per epoch with SO only increases by
about 1.6 times compared to the time without SO.
                                                             Acknowledgments
                                                             The authors are grateful to anonymous reviewers for
4.5. Discussions and Future Work                             their constructive comments that greatly improved the
                                                             presentation of this paper. T. Xiao thanks his Ph.D. advi-
Multi-stage Search. As the initial base dimension
                                                             sor Dr. Krishnakumar Balasubramanian for supports and
for each feature fields it tunable, we observe a gain,
                                                             helpful discussions during the internship at ByteDance.
∼0.3%, in test AUC from preliminary experiments on
MovieLens-1M by searching via HAM with a smaller ini-
tial size = 8. This observation confirms the claim made References
in recent works [38] that enlarging search space is un-
beneficial or even detrimental to existing NAS methods. [1] W. Cheng, Y. Shen, L. Huang, Differentiable neu-
To address this issue, our method can also be adapted to           ral input search for recommender systems, arXiv
design a multi-stage search strategy with a sufficiently           preprint arXiv:2006.04466 (2020).
larger initial size and stage-wisely prune embeddings         [2] X. Zhao, H. Liu, H. Liu, J. Tang, W. Guo, J. Shi,
with decreasing target embedding size 𝑠.                           S. Wang, H. Gao, B. Long, Memory-efficient em-
   Comparison with PEP. To justify the advantages                  bedding for recommendations, arXiv preprint
of our proposed framework, we also discuss the most                arXiv:2006.14827 (2020).
recent Plug-in Embedding Pruning (PEP) [14] method. [3] S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning
As introduced before, PEP is an unstructured pruning               based recommender system: A survey and new
method that retains a large number of zero parameters              perspectives, ACM Computing Surveys (CSUR) 52
and requires the technique of sparse matrix storage. On            (2019) 1–38.
the contrary, our method can not only prune the em- [4] P. Covington, J. Adams, E. Sargin, Deep neural net-
bedding columns explicitly but also reduce the parame-             works for youtube recommendations, in: Proceed-
ters in the dense layers while PEP cannot. For example,            ings of the 10th ACM conference on recommender
HAM (𝑠 = 28) can obtain a lighter DCN-V2 model on                  systems, 2016, pp. 191–198.
MovienLens-1M with only ∼8% number of parameters              [5] X. Zhao, C. Wang, M. Chen, X. Zheng, X. Liu,
(3,281/40,241) in the dense layers comparing to the model          J. Tang, Autoemb: Automated embedding dimen-
with a uniform embedding size 16.                                  sionality search in streaming recommendations,
   Comparison with SSEDS. During the revisions of                  arXiv preprint arXiv:2002.11252 (2020).
this paper, we noticed that a concurrent work [39] also       [6] H. Liu, X. Zhao, C. Wang, X. Liu, J. Tang, Auto-
proposes to prune embeddings column-wisely as well                 mated embedding size search in deep recommender
as row-wisely using indicator functions. Our work is               systems, in: Proceedings of the 43rd International
different from theirs in two aspects: (i) we focus on field-       ACM SIGIR Conference on Research and Develop-
wise search by pruning nearly orthogonal embedding col-            ment in Information Retrieval, 2020, pp. 2307–2316.
umn vectors; (ii) we use a gradient-descent-based method      [7] M. R. Joglekar, C. Li, M. Chen, T. Xu, X. Wang,
in the search stage to solve the bilevel problem on the            J. K. Adams, P. Khaitan, J. Liu, Q. V. Le, Neural in-
training and validation set; the gradient dynamics enable          put search for large scale recommendation models,
us to re-activate (un-mask) masked embeddings while                in: Proceedings of the 26th ACM SIGKDD Interna-
SSEDS directly mask all components with small gradient             tional Conference on Knowledge Discovery & Data
magnitudes. Comparing our method with SSEDS and                    Mining, 2020, pp. 2387–2397.
incorporating our method into SSEDS is our future work. [8] T. Chen, L. Li, Y. Sun, Differentiable product quan-
                                                                   tization for end-to-end embedding compression,
     in: International Conference on Machine Learning,                 or propagating gradients through stochastic neu-
     PMLR, 2020, pp. 1617–1626.                                        rons for conditional computation, arXiv preprint
 [9] W.-C. Kang, D. Z. Cheng, T. Chen, X. Yi, D. Lin,                  arXiv:1308.3432 (2013).
     L. Hong, E. H. Chi, Learning multi-granular quan-            [23] E. Jang, S. Gu, B. Poole, Categorical reparameteri-
     tized embeddings for large-vocab categorical fea-                 zation with gumbel-softmax, in: International Con-
     tures in recommender systems, in: Companion                       ference on Learning Representations, 2016.
     Proceedings of the Web Conference 2020, 2020, pp.            [24] S. Srinivas, A. Subramanya, R. Venkatesh Babu,
     562–566.                                                          Training sparse neural networks, in: Proceedings
[10] A. A. Ginart, M. Naumov, D. Mudigere, J. Yang,                    of the IEEE conference on computer vision and pat-
     J. Zou, Mixed dimension embeddings with applica-                  tern recognition workshops, 2017, pp. 138–145.
     tion to memory-efficient recommendation systems,             [25] M. Ye, D. Choudhary, J. Yu, E. Wen, Z. Chen, J. Yang,
     in: 2021 IEEE International Symposium on Infor-                   J. Park, Q. Liu, A. Kejariwal, Adaptive dense-to-
     mation Theory (ISIT), IEEE, 2021, pp. 2786–2791.                  sparse paradigm for pruning online recommen-
[11] H. Liu, K. Simonyan, Y. Yang, Darts: Differentiable               dation system with non-stationary data, arXiv
     architecture search, in: International Conference                 preprint arXiv:2010.08655 (2020).
     on Learning Representations, 2018.                           [26] S. Rendle, Factorization machines, in: 2010 IEEE In-
[12] R. Wang, M. Cheng, X. Chen, X. Tang, C.-J. Hsieh,                 ternational conference on data mining, IEEE, 2010,
     Rethinking architecture selection in differentiable               pp. 995–1000.
     nas, arXiv preprint arXiv:2108.04392 (2021).                 [27] H. Guo, R. Tang, Y. Ye, Z. Li, X. He, Deepfm: a
[13] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan,               factorization-machine based neural network for ctr
     Q. Le, Understanding and simplifying one-shot                     prediction, in: Proceedings of the 26th International
     architecture search, in: International Conference                 Joint Conference on Artificial Intelligence, 2017, pp.
     on Machine Learning, PMLR, 2018, pp. 550–559.                     1725–1731.
[14] S. Liu, C. Gao, Y. Chen, D. Jin, Y. Li, Learnable em-        [28] W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang,
     bedding sizes for recommender systems, in: Inter-                 J. Tang, Autoint: Automatic feature interaction
     national Conference on Learning Representations,                  learning via self-attentive neural networks, in: Pro-
     2020.                                                             ceedings of the 28th ACM International Conference
[15] D. Blalock, J. J. G. Ortiz, J. Frankle, J. Guttag, What is        on Information and Knowledge Management, 2019,
     the state of neural network pruning?, arXiv preprint              pp. 1161–1170.
     arXiv:2003.03033 (2020).                                     [29] R. Wang, B. Fu, G. Fu, M. Wang, Deep & cross
[16] P. Savarese, H. Silva, M. Maire, Winning the lottery              network for ad click predictions, in: Proceedings
     with continuous sparsification, in: Advances in                   of the ADKDD’17, 2017, pp. 1–7.
     Neural Information Processing Systems, volume 33,            [30] X. He, T.-S. Chua, Neural factorization machines for
     Curran Associates, Inc., 2020, pp. 11380–11390.                   sparse predictive analytics, in: Proceedings of the
[17] C. Louizos, M. Welling, D. P. Kingma, Learning                    40th International ACM SIGIR conference on Re-
     sparse neural networks through l_0 regularization,                search and Development in Information Retrieval,
     in: International Conference on Learning Represen-                2017, pp. 355–364.
     tations, 2018.                                               [31] R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin,
[18] S. Xie, H. Zheng, C. Liu, L. Lin, Snas: stochastic                L. Hong, E. Chi, Dcn v2: Improved deep & cross
     neural architecture search, in: International Con-                network and practical lessons for web-scale learn-
     ference on Learning Representations, 2018.                        ing to rank systems, in: Proceedings of the Web
[19] Y. Guo, A. Yao, Y. Chen, Dynamic network surgery                  Conference 2021, 2021, pp. 1785–1797.
     for efficient dnns, in: Proceedings of the 30th In-          [32] N. Bansal, X. Chen, Z. Wang, Can we gain more
     ternational Conference on Neural Information Pro-                 from orthogonality regularizations in training deep
     cessing Systems, 2016, pp. 1387–1395.                             networks?, Advances in Neural Information Pro-
[20] X. Xiao, Z. Wang, Autoprune: Automatic network                    cessing Systems 31 (2018) 4261–4271.
     pruning by regularizing auxiliary parameters, Ad-            [33] P. RodrΓ­guez, J. Gonzalez, G. Cucurull, J. M. Gonfaus,
     vances in Neural Information Processing Systems                   X. Roca, Regularizing cnns with locally constrained
     32 (NeurIPS 2019) 32 (2019).                                      decorrelations, arXiv preprint arXiv:1611.01967
[21] X. Zhou, W. Zhang, H. Xu, T. Zhang, Effective spar-               (2016).
     sification of neural networks with global sparsity           [34] S. Yan, Y. Zheng, W. Ao, X. Zeng, M. Zhang, Does
     constraint, in: Proceedings of the IEEE/CVF Confer-               unsupervised architecture representation learning
     ence on Computer Vision and Pattern Recognition,                  help neural architecture search?, Advances in Neu-
     2021, pp. 3599–3608.                                              ral Information Processing Systems 33 (2020) 12486–
[22] Y. Bengio, N. LΓ©onard, A. Courville, Estimating                   12498.
[35] H. Hazimeh, Z. Zhao, A. Chowdhery, M. Sathi-
     amoorthy, Y. Chen, R. Mazumder, L. Hong, E. H. Chi,
     Dselect-k: Differentiable selection in the mixture
     of experts with applications to multi-task learning,
     arXiv preprint arXiv:2106.03760 (2021).
[36] P. Molchanov, S. Tyree, T. Karras, T. Aila,
     J. Kautz, Pruning convolutional neural networks
     for resource efficient inference, arXiv preprint
     arXiv:1611.06440 (2016).
[37] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, J. Kautz,
     Importance estimation for neural network pruning,
     in: Proceedings of the IEEE/CVF Conference on
     Computer Vision and Pattern Recognition, 2019,
     pp. 11264–11272.
[38] Y. Ci, C. Lin, M. Sun, B. Chen, H. Zhang, W. Ouyang,
     Evolving search space for neural architecture
     search, in: Proceedings of the IEEE/CVF Interna-
     tional Conference on Computer Vision, 2021, pp.
     6659–6669.
[39] L. Qu, Y. Ye, N. Tang, L. Zhang, Y. Shi, H. Yin, Single-
     shot embedding dimension search in recommender
     system, arXiv preprint arXiv:2204.03281 (2022).