=Paper= {{Paper |id=Vol-3317/paper2 |storemode=property |title=Field-wise Embedding Size Search via Structural Hard Auxiliary Mask Pruning for Click-Through Rate Prediction |pdfUrl=https://ceur-ws.org/Vol-3317/Paper2.pdf |volume=Vol-3317 |authors=Tesi Xiao,Xia Xiao,Ming Chen,Youlong Chen |dblpUrl=https://dblp.org/rec/conf/cikm/XiaoXCC22 }} ==Field-wise Embedding Size Search via Structural Hard Auxiliary Mask Pruning for Click-Through Rate Prediction== https://ceur-ws.org/Vol-3317/Paper2.pdf

Field-wise Embedding Size Search via Structural Hard
Auxiliary Mask Pruning for Click-Through Rate Prediction
Tesi Xiao1 , Xia Xiao2 , Ming Chen2 and Youlong Cheng2,*
1
University of California, Davis, USA
2
ByteDance, Mountain View, USA

Abstract
Feature embeddings are one of the most important steps when training deep learning based Click-Through Rate prediction
models, which map high-dimensional sparse features to dense embedding vectors. Classic human-crafted embedding size
selection methods are shown to be “sub-optimal" in terms of the trade-off between memory usage and model capacity. The
trending methods in Neural Architecture Search (NAS) have demonstrated their efficiency to search for embedding sizes.
However, most existing NAS-based approaches suffer from expensive computational costs, the curse of dimensionality of
the search space, and the discrepancy between continuous search space and discrete candidate space. Other methods that
prune embeddings in an unstructured manner fail to explicitly reduce the computational costs. In this paper, to address those
limitations, we propose a novel strategy that searches for the optimal mixed-dimension embedding scheme by structurally
pruning a super-net via Hard Auxiliary Mask. Our method aims to directly search candidate models in the discrete space
using a simple and efficient gradient-based method. Furthermore, we introduce orthogonal regularity on embedding tables to
reduce correlations within embedding columns and enhance representation capacity. Extensive experiments demonstrate it
can effectively remove redundant embedding dimensions without great performance loss.

Keywords
CTR Prediction, Embedding Size, Network Pruning, Neural Architecture Search

1. Introduction embedding dimension for all features, either due to the
prerequisites of the model input or simply for the sake
Deep learning based recommender systems (DLRS) have of convenience. If the embedding dimensions are uni-
demonstrated superior performance over more tradi- formly high, it leads to increased memory usage and
tional recommendation techniques [3]. The success of computational cost, as it fails to handle the heterogeneity
DLRS is mainly attributed to their ability to learn mean- among different features. As a concrete example, encod-
ingful representations with categorical features, that sub- ing features with few unique values like gender with large
sequently help with modeling the non-linear user-item embedding vectors leads to over-parametrization. Con-
relationships efficiently. Indeed, real-world recommenda- versely, the selected embedding size may be insufficient
tion tasks usually involve a large number of categorical for highly-predictive features with large cardinalities,
feature fields with high cardinality (i.e. the number of such as the user’s last search query. Therefore, finding
unique values or vocabulary size) [4]. One-Hot Encoding appropriate embedding dimensions for different feature
is a standard way to represent such categorical features. fields is essential.
To reduce the memory cost of One-Hot Encoding, DLRS The existing works towards automating embedding
first maps the high-dimensional one-hot vectors into size search can be categorized into two groups: (i) field-
real-valued dense vectors via the embedding layer. Such wise search [5, 2, 6]; (ii) vocabulary-wise search [7, 8, 9, 10].
embedded vectors are subsequently used in predictive The former group aims to assign different embedding
models for obtaining the required recommendations. sizes to different feature fields, while embeddings in the
In this pipeline, the choice of the dimension of the same feature field share the same dimension. The lat-
embedding vectors, also known as embedding dimension, ter further attempts to find different embedding sizes
plays a crucial role in the overall performance of the to different feature values within the same feature field,
DLRS. Most existing models assign a fixed and uniform which is generally based on the frequencies of feature
DL4SR’22: Workshop on Deep Learning for Search and Recommen- values. Although it has been shown that the latter group
dation, co-located with the 31st ACM International Conference on can significantly reduce the model size without a great
Information and Knowledge Management (CIKM), October 17-21, 2022, performance drop, the second group of works suffers
Atlanta, USA from several challenges and drawbacks (we refer readers
*
Corresponding author.
to Section 3.4 in [2] for details): (a) a large number of
$ texiao@ucdavis.edu (T. Xiao); x.xiaxiao@bytedance.com
(X. Xiao); ming.chen@bytedance.com (M. Chen); unique values in each feature field leads to a huge search
youlong.chen@bytedance.com (Y. Cheng) space in which the optimal solution is difficult to find;
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). (b) the feature values frequencies are time-varying and
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
𝑦!
Our Framework DNIS [Cheng et al., 2020]
Interaction Layers
Magnitude-based Pruning
Soft Auxiliary Mask 0.6 0.1 0.9 0.8

⊙ -0.36 0.03 -0.18 0.96

1 0 1 1 0 1 1 0 1 1
Hard Auxiliary Mask Embedding Vector -0.6 0.3 -0.2 0.12 Output
⊙ ⊙ ⊙ Embedding
1.2 1 -0.9 -0.6 0.3 -0.2 0.12 0.1 0.3 -2.1 Vector
Embedding Vector AutoDim [Zhao et al., 2020]
Embedding
Look-up
0.4 -0.2 -0.3 0.2 -1 0.1 -0.6 -1 0.2 -0.1 Linear Transform / Zero Padding
-0.2 0.1 -0.8 -0.5 0.7 -0.3 0.2 0.1 0.3 -2.1
Gumbel-softmax
Embedding Table V! V" V# Output
1.2 1 -0.9 -0.6 0.3 -0.2 0.12 1.2 -0.1 0.1 + Embedding
0.3 0.2 0.7 0.1 0.2 0.5 -0.4 0.7 0.6 -0.4 Batch Norm + Weighted Sum Vector
Embedding Lookup
with Memory Sharing
Field 1 Field 𝑗 Field 𝐾

Figure 1: (left) The framework of our method. The operator ⊙ denotes the element-wise product. The gray embedding
components are masked and thus can be removed directly. (right) The basic ideas of DNIS [1] and AutoDim [2].

not pre-known in real-time recommender system; (c) it is into the same dimension with batch normalization so
difficult to handle embeddings with different dimensions that they can be aggregated with architecture weights
for the same feature field in the state of art DLRS due to of candidate sizes and fed into the subsequent layers to
the feature crossing mechanism. As a result, we fix our obtain the prediction. Cheng et al. [1] also leverage the
attention to the field-wise embedding size search in this idea of DARTS but with the architecture weights being
work. the weights of sub-blocks of embedding matrices.
The majority of recent works towards automating em- Despite the obtained promising results, the existing
bedding size search are based on the trending ideas in methods have certain limitations which arise from the
Neural Architecture Search (NAS). NAS has been an ac- discrepancy between the large discrete candidate space
tive research area recently, which is composed of three and its continuous relaxation. On the one hand, the
elements: (i) a search space 𝒜 of all candidate models; DARTS-based methods in [5, 2, 1] are algorithmically
(ii) a search strategy that goes through models in 𝒜; (iii) efficient but search in an augmented continuous space;
a performance estimation strategy that evaluates the per- it has been shown that the discrepancy may lead to
formance of the selected model. To leverage the NAS finding unsatisfactory candidates because the magnitude
techniques to search embedding sizes for each feature of architecture weights does not necessarily indicate
field, Joglekar et al. [7] formulate the search space by dis- how much the operation contributes to the overall
cretizing an embedding matrix into several sub-matrices performance [12]. On the other hand, while the RL-based
and take a Reinforcement Learning (RL) approach with methods in [7, 6] and the hard selection variant in [5]
the Trainer-Controller framework for search and per- directly search and evaluate models in the original
formance estimation, in which the controller samples discrete candidate space, they are computationally heavy
different embedding blocks to maximize the reward ob- and thus not favored by the large-scale DLRS. Therefore,
served from the trainer that trains the model with the a natural question follows:
controller’s choices. Liu et al. [6] also adopt the RL-based
search which starts with small embedding sizes and takes Is it possible to come up with an efficient method that
actions to keep or enlarge the size based on the predic- searches straight through the discrete candidate space of
tion error. Their proposed method requires sequentially all possible embedding sizes?
training DRLS and the policy network repeatedly until
convergence, which is computationally expensive. In this work, we provide a positive answer by propos-
Given the rise of one-shot NAS aiming to search ing an end-to-end field-aware embedding size search
and evaluate candidate models together in an over- method leveraging the Straight Through Estimator (STE)
parametrized supernet, Zhao et al. [5, 2] follow the idea of gradients for non-differentiable functions. Motivated
of Differential Neural Architecture Search (DARTS)[11] by one-shot NAS and network pruning, our method seeks
and solve a continuous bilevel optimization by the to adaptively search for a good sub-network in a pre-
gradient-based algorithm. To be specific, they define trained supernet by masking redundant dimensions. The
a search space that includes all possible embedding sizes mask, named as Hard Auxiliary Mask (HAM), is real-
of each feature field; for instance in [2], 5 candidate ized by the indicator function of auxiliary weights. Fur-
sizes {2, 8, 16, 24, 32} are selected for each feature field. thermore, to reduce the information redundancy and
These embedding vectors in different sizes are then lifted boost the search performance, we impose the orthogonal
regularity on embedding tables and train models with puts, existing works attempt to cast it into a differentiable
regularized loss functions. Contributions and nov- problem and provide gradient-based search algorithms. A
elty of our work are: (i) we propose to prune embed- straightforward method is to directly replace binary val-
ding tables column-wisely via hard auxiliary masks for ues by smooth functions of auxiliary parameters during
the CTR prediction models, which can effectively com- the forward pass, such as sigmoid [16], Gumbel-sigmoid
press the model; (ii) we offer a gradient-based pruning [17], softmax [11], Gumbel-softmax [18], and piece-wise
method and delve into its dynamics with mathematical linear [19]. However, all these methods, which we name
insights; (iii) we introduce orthogonal regularity for train- Soft Auxiliary Mask (SAM), suffer from the discrepancy
ing recommendation models and experimentally show caused by continuous relaxation. Other methods, which
that orthogonal regularity can help to achieve significant we refer to Hard Auxiliary Mask (HAM), preserve binary
improvement; (iv) we obtain state-of-the-art results on values in the forward pass with indicator functions [20]
various modern models and our method is scalable on or Bernoulli random variables [21] and optimize parame-
large datasets. ters using the straight-through estimator (STE) [22, 23].
A detailed comparison of four representative auxiliary
masking methods for approximating binary values is pro-
2. Related Work vided in Table 1.
Embedding Size Search for Recommender Systems.
To deal with the heterogeneity among different features 3. Methodology
and reduce the memory cost of sizable embedding tables,
several recent works introduce a new paradigm of mixed- 3.1. Preliminaries
dimension embedding tables. In particular, Ginart et al.
[10] propose to substitute the uniform-dimension embed- Here we briefly introduce the mechanism of DLRS and
dings V ∈ R𝐶×𝑑 with V = EP, where E ∈ R𝐶×𝑠 is define the terms used throughout the paper.
the smaller embeddings with 𝑠 ≪ 𝑑 and P ∈ R𝑠×𝑑 is the Model Architecture. Consider the data input
projection that lift the embeddings into the base dimen- involves 𝐾 features fields from users, items, their
sion 𝑑 for feature crossing purposes. A popularity-based interactions, and contextual information. We de-
rule is further designed to find the dimension 𝑠 for each note these raw features by multiple one-hot vectors
feature field which is too rough and cannot handle impor- x1 ∈ R𝐶1 , . . . , x𝐾 ∈ R𝐶𝐾 , where field dimensions
tant features with low frequency. In addition, plenty of 𝐶1 , . . . 𝐶𝐾 are the cardinalities of feature fields1 . Ar-
works seek to provide an end-to-end framework leverag- chitectures of DLRS often consist of three key com-
ing the advances in NAS, including RL approaches [7, 6] ponents: (i) embedding layers with tables V1 ∈
and differentiable NAS methods [5, 13, 2, 1]. The most R𝐶1 ×𝑑1 , . . . , V𝐾 ∈ R𝐶𝐾 ×𝑑𝐾 that map sparse one-hot
relevant work to us is [1], which employs a soft auxiliary vectors to dense vectors in a low dimensional embedding
mask to prune the embeddings structurally. However, it space by v𝑖 = V𝑖⊤ x𝑖 ; (ii) feature interaction layers that
requires magnitude-based pruning to derive fine-grained model complex feature crossing using embedding vec-
mixed embedding, in which the discrepancy between the tors; (iii) output layers that make final predictions for
continuous relaxed search space and discrete candidate specific recommendation tasks.
space could lead to a relatively great loss in performance. The feature crossing techniques of existing models
The most recent work [14] proposes a Plug-in Embed- fall into two types - vector-wise and bit-wise. Models
ding Pruning (PEP) approach to prune embedding tables with vector-wise crossing explicitly introduce interac-
into sparse matrices to reduce storage costs, Albeit sig- tions by the inner product, such as Factorization Machine
nificantly reducing the number of non-zero embedding (FM) [26], DeepFM [27] and AutoInt [28]. The bit-wise
parameters, this type of unstructured pruning method crossing, in contrast, implicitly adds interaction terms
fails to explicitly reduce the embedding size. by element-wise operations, such as the outer product
Neural Network Pruning via Auxiliary Masks. To in Deep Cross Network (DCN) [29], and the Hadamard
deploy deep learning models on resource-limited devices, product in NFM [30] and DCN-V2 [31]. In this work, we
there is no lack of work in neural network pruning; see deploy our framework to the Click-Through Rate (CTR)
[15] and the reference therein. One popular approach prediction problem with four base models: FM, DeepFM,
towards solving this problem is to learn a pruning mask AutoInt, and DCN-V2.
through auxiliary parameters, which considers pruning Notations. Throughout this paper, we use ⊙ for
as an optimization problem that tends to minimize the the Hadamard (element-wise) product of two vectors.
supervised loss of the masked neural network with a The indicator function of a scalar-valued 𝛼 is defined
certain sparsity constraint. As learning the optimal mask as 1𝛼>0 = 1 if 𝛼 > 0; 1𝛼>0 = 0 if 𝛼 ≤ 0. The
is indeed a discrete optimization problem for binary in- 1
Numeric features are converted into categorical data by binning.
Table 1
A List of Representative Auxiliary Masking Methods
Mask Forward Backward Modeleval = Modelsel
Deterministic 𝛼 ∈ [0, 1][1]; Sigmoid(𝛼), 𝛼 ∈ R[16] autograd ✗
Soft 𝑢
Sigmoid(log 𝛼 + log( 1−𝑢 )) [17],
Stochastic autograd ✗
𝛼 ∈ R, 𝑢 ∼ Uniform(0,1)
Stochastic Bernoulli(p) STE [24], Gumbel-STE [21] ✗
Hard
Deterministic 1𝛼>0 STE [20, 25] (Our Approach) ✓

Remark. Modeleval stands for the evaluated masked model; Modelsel is the final model selected by the algorithm.

indicator function of a vector 𝛼 ∈ R𝑑 is defined as probability score 𝑦˜ is given as follows:
1𝛼>0 = [1𝛼1 >0 , . . . , 1𝛼𝑑 >0 ]⊤ . The identity matrix is 𝑦˜ = 𝜓(v ⊙ 1𝛼>0 |Θ) = 𝜑(x|Ṽ𝛼 , Θ) = 𝜑(x|𝛼, V, Θ),
written as I and the function diag(·) returns a diagonal 𝛼1 𝛼
matrix with its diagonal entries as the input. We use where Ṽ𝛼 = {Ṽ1 , . . . , Ṽ𝐾𝐾 } is the pruned embed-
‖ · ‖1 , ‖ · ‖𝐹 for the ℓ1 norm of vectors and the Frobenius ding layer with
= V𝑖 diag(1𝛼𝑖 >0 ),
𝛼𝑖
norm of matrices respectively. Ṽ𝑖 𝑖 = 1, . . . 𝐾.
𝛼𝑖
We emphasize that embedding tables Ṽ𝑖 are pruned
3.2. Background column-wisely in a structural manner unlike PEP [14]
that prunes entry-wisely.
In general, the CTR prediction models takes the concate-
Orthogonal Regularity Given the embedding table
nation of all feature fields from a user-item pair, denoted
V𝑗 ∈ R𝐶𝑗 ×𝑑𝑗 for feature field 𝑗, its column vectors, de-
by x = [x1 ; x2 ; . . . ; x𝐾 ] as the input vector. Given the
noted by V𝑗,1 , . . . , V𝑗,𝑑𝑗 ∈ R𝐶𝑗 , can be regarded as 𝑑𝑗
embedding layer V = {V1 , V2 , . . . , V𝐾 }, the feature
different representations of feature 𝑗 in the embedding
interaction layers take the embedding vectors v and feed
space. The auxiliary masks above aim to mask relatively
the learned hidden states into the output layer to obtain
uninformative column vectors so as to reduce the model
the prediction. The embedding vectors v are the dense
size. Nonetheless, the presence of correlation between
encoding of the one-hot input x that can be defined as
these vectors may complicate the selection procedure.
follows:
Specifically, presuming that the most predictive one V𝑗,𝑝
v = [v1 ; v2 ; . . . ; v𝐾 ] has been selected, it would be problematic if we greedily
[︁ ]︁ select the next column V𝑗,𝑞 that brings the largest loss
= V1⊤ x1 ; V2⊤ x2 ; . . . ; V𝐾
⊤
x𝐾 := 𝒱x,
drop when included in. For instance, if V𝑗,𝑞 ̸⊥ V𝑗,𝑝 ,
where 𝒱 is the embedding lookup operator. The predic- we have the following decomposition: V𝑗,𝑞 = p + p⊥ ,
tion score 𝑦ˆ is then calculated with models’ other param- where p = 𝑐V𝑗,𝑝 ‖ V𝑗,𝑝 and p⊥ ⊥ p. Therefore, it
eters Θ in the feature interaction layers and output layer would be difficult to determine whether the increments
by 𝑦ˆ = 𝜓(v|Θ) = 𝜓(𝒱x|Θ) = 𝜑(x|V, Θ), where are attributed to the existing direction p or the new fac-
𝑦ˆ ∈ [0, 1] is the predicted probability, 𝜓(·|Θ) is the pre- tor p⊥ . To address this issue, we follow [32] to train
diction function of embedding vectors, and 𝜑 = 𝜓 ∘ 𝒱 is embedding parameters V with Soft Orthogonal (SO) reg-
the prediction function of raw inputs. To learn the model ularizations:
𝐾
parameters V, Θ, we aim to minimize the Log-loss on ∑︁
ℛ(V) = ‖V𝑗⊤ V𝑗 − I‖2𝐹 /𝑑2𝑗 , (1)
the training data, i.e.,
𝑗=1
min ℒtrain (V, Θ) where divisors 𝑑2𝑗 are introduced to handle heteroge-
V,Θ
𝑁
neous dimensionality of embedding tables. We also adopt
1 ∑︁ [︀ ]︀ a relaxed SO regularization in which V𝑗 replaced by the
:= − 𝑦𝑗 log(𝑦ˆ𝑗 ) + (1 − 𝑦𝑗 ) log(1 − 𝑦ˆ𝑗 )
𝑁 𝑗=1 normalized matrix V𝑗 with unit column vectors, which
corresponds to the pair-wise cosine similarities within
where 𝑁 is the total number of training samples. embedding columns [33].
Hard Auxiliary Mask As illustrated in Figure 1, we
add auxiliary masks for each embedding dimension slot.
3.3. Framework
Specifically, the auxiliary masks are indicator functions
of auxiliary parameters 𝛼 = [𝛼1 ; 𝛼2 ; . . . ; 𝛼𝐾 ], where Our proposed framework is motivated by one-shot NAS,
𝛼𝑖 ∈ R𝑑𝑖 is in the same size of the corresponding embed- which consists of three stages: pretrain, search, and re-
ding vector v𝑖 . Provided with the masks, the predicted train.
Pretrain. As shown in [34], pre-training architecture Early Stopper for the Search Stage. We can deter-
representations improve the downstream architecture mine to stop the search stage if the number of positive
search efficiency. Therefore, in the pretraining stage, we auxiliary weights is close to the target size with no signif-
train the base model with a large embedding layer V. icant gain in validation AUC since the sign of auxiliary
The base dimension 𝑑𝑗 for each feature field is deter- weights exactly indicates whether the corresponding em-
mined by prior knowledge. In addition, the embedding bedding components are pruned or not. On the contrary,
dimension 𝑑𝑗 should not exceed the field dimension 𝐶𝑗 it is hard to deploy the early stopper for the search stage
to avoid column-rank-deficiency. The SO regularization using other auxiliary mask pruning methods because the
term (1) is added to the mini-batch training loss for the value of validation AUC during the search epochs is un-
optimizer to learn near-orthogonal embeddings. The able to represent the performance of the model selected
learned model parameters, denoted by Vinit , Θinit , are in the retraining stage.
passed to the search stage as initialization. Retrain. After training {𝛼, V, Θ} for several epochs
Search. Provided with the pre-trained model, the goal in the search step, we obtain a binary mask based on
of the search stage is to find the column-wise sparse the sign of 𝛼 and continue optimizing {V, Θ} till con-
embedding layer that preserves model accuracy, which vergence. The early stopper is deployed for both pre-
can be formulated as: training and retraining steps, which terminates training
min min ℒtrain Ṽ𝛼 , Θ + 𝜇⃒‖1𝛼>0 ‖1 − 𝑠⃒, if model performance on validation data cannot be im-
(︁ )︁ ⃒ ⃒
𝛼 V,Θ proved within a few epochs. The overall framework is
where ‖1𝛼>0 ‖1 counts the number of non-zero embed- described in Algorithm 1.
ding columns and 𝑠 is the target number of non-zero
columns. Note that instead of direct regularization on Algorithm 1: Structural HAM Pruning
‖1𝛼>0 ‖1 as [16, 20, 25] do, we include the target num- Input: data 𝒟train , 𝒟val , 𝒟test ,
ber 𝑠 to reduce instability from batched training and the base dimensions 𝑑𝑗 (≤ 𝐶𝑗 ) for each field 𝑗,
choice of hyperparameter 𝜇. However, given that the ob- target embedding size 𝑠, hyperparameters
jective function above is non-differentiable when 𝛼 = 0 𝜇, 𝜖 > 0;
and has a zero gradient anywhere else, traditional gra- ◁ Pretrain:
dient descent methods are not applicable. To that end, while stopping criteria not met do
we use the straight-through estimator (STE) [22], which obtain a batch of samples from 𝒟train and
replaces the ill-defined gradient in the chain rule with a update V, Θ regularized with SO (1) by
fake gradient. Despite various smooth alternatives used certain optimizer;
in the literature, such as sigmoid [16] and piecewise poly- end
nomials [35], we adopt the simplest identity function for
◁ Search: initialize 𝛼 = 𝜖1 ⃗;
backpropagation, i.e.,
while stopping criteria not met do
𝜕ℒ 𝜕ℒ 𝜕 1𝛼>0 𝜕ℒ 𝜕𝛼 𝜕ℒ obtain a batch of samples from 𝒟val and
= ≈ = . (2)
𝜕𝛼 𝜕 1𝛼>0 𝜕𝛼 𝜕 1𝛼>0 𝜕𝛼 𝜕 1𝛼>0 update 𝛼 by Eq. (3);
Then, the search stage starts with the unmasked model obtain a batch of samples from 𝒟train and
that 𝛼0 = 𝜖 · ⃗1 for some small 𝜖 > 0. The gradient update V, Θ by certain optimizer;
update rules for 𝛼 at iteration 𝑡 are given by: end
◁Retrain: mask the dimensions with 0 where
𝛼𝑡+1 = 𝛼𝑡 −𝜂·∇(1𝛼𝑡 >0 ) ℒbatch −𝜇·sign(‖1𝛼𝑡 >0 ‖1 −𝑠)1 ⃗,
𝛼 < 0 and retrain the model until the stopping
(3)
criteria are met.
where 𝜂 is the learning rate. We will illustrate in Sec-
tion 3.4 that the above updates enjoy certain benefits
and the last term plays an important role by pushing
the optimizer iteratively to evaluate candidate models 3.4. On the Gradient Dynamics
with a hard auxiliary mask. Furthermore, to enhance the
stability and performance, we implement a multi-step As finding the optimal mask is essentially a combinatorial
training through iteratively training auxiliary parameters optimization problem over the set of 2 possible status
𝑆

on validation data and retraining the original weights, of on-off switches (𝑆 is the number of switches), the pro-
which attempts to solve the following bi-level optimiza- posed gradient-based search rule given in (3) provides
tion problem: an alternative approach to look through the discrete can-
didate sets in a continuous space. We elaborate below
min ℒval Ṽ𝛼 , Θ + 𝜇⃒‖1𝛼>0 ‖1 − 𝑠⃒
(︁ ⋆ )︁
⋆
⃒ ⃒
𝛼
that the STE-based gradient in (2) is related to the Tay-
⋆
(︁ )︁ lor approximation for the Leave-One-Out error and the
s.t. Ṽ𝛼 , Θ = arg min ℒtrain Ṽ𝛼 , Θ .
⋆
penalty term drifts auxiliary variables to find a model of
Ṽ𝛼 ,Θ
the target size. if # of positive 𝛼 ) s < 𝑠
velocity of α!,# ≈ 𝜂 - (ℒ[%&'!,#] −ℒ['!,#] )
For illustration purpose, we start with the dynamics 𝑣external = 𝜇 > 0
of the scalar-valued parameter 𝛼𝑖,𝑗 , the 𝑗-th element of α!,#
𝛼𝑖 , that controls the mask for the 𝑗-th column vector, 0 un-masked V!,#

V𝑖,𝑗 , of the embedding table V𝑖 . Define the function masked 0 - V! $ ,# $
ℓ(𝑐) := ℒ[𝑐·V𝑖,𝑗 ] as the value of loss function when α! $,# $
replacing V𝑖,𝑗 by 𝑐 · V𝑖,𝑗 (0 ≤ 𝑐 ≤ 1). By the first-order if # of positive 𝛼 ) s > 𝑠
𝑣external = −𝜇 < 0
Taylor approximation, we have ℓ(1) − ℓ(0) ≈ ℓ′ (0) and
ℓ(0) − ℓ(1) ≈ −ℓ′ (1). Moreover, it is worth noting
that the STE-based gradient in (2) with regards to 𝛼𝑖,𝑗 is
Figure 2: An intuitive viewpoint of the gradient dynamics.
exactly ℓ′ (1𝛼𝑖,𝑗 >0 ), i.e.,

𝜕ℒbatch
= ℓ′ (1𝛼𝑖,𝑗 >0 ) ≈ ℓ(1) − ℓ(0) the literature in terms of prediction ability, and memory
𝜕1𝛼𝑖,𝑗 >0 (4)
consumption? (iii) How does the soft orthogonality regu-
= ℒ[V𝑖,𝑗 ] − ℒ[0·V𝑖,𝑗 ] .
larity boost the performance of our proposed framework?
In other words, the proposed gradient calculates the im- In the following, we will first introduce the experimental
portance (with batches of data) of the embedding com- setups including datasets, baselines, and implementation
ponent V𝑖,𝑗 using the first-order Taylor approximation. details, and then present results as well as discussions.
The above heuristics are also mentioned in [36, 37, 25].
We now consider the dynamics of all the auxiliary 4.1. Datasets and Data Preparation.
parameters 𝛼𝑖,𝑗 ’s provided by (3). As illustrated above,
We use three benchmark datasets in our experiments: (i)
the term ∇(1𝛼𝑡 >0 ) ℒdata measures the importance of each
MovieLens-1M: This dataset contains users’ ratings (1-
embedding component by approximating the difference
5) on movies. We treat samples with ratings greater than
between the loss value with un-masked embeddings and
3 as positive samples and samples with ratings below 3
the loss value with a mask on. Furthermore, the sign
as negative samples. Neutral samples with a rating equal
of the penalty term 𝜇 in (3) is determined by the com-
to 3 are dropped; (ii) Criteo: This dataset has 45 mil-
parison between the number of un-masked components
lion users’ clicking records on displayed ads. It contains
and the target size 𝑠. As an analogue, the dynamics of
26 categorical feature fields and 13 numerical feature
auxiliary parameters can be viewed as a particle system
fields; (iii) Avazu: This dataset has 40 million clicking
in a neighborhood of 0 on real line, in which the particles
records on displayed mobile ads. It has 22 feature fields
are initialized at the same point 𝜖 > 0 (i.e., the model
spanning from user/device features to ad attributes. We
starts with all embeddings un-masked) and the velocity
follow the common approach to remove the infrequent
of each particle is roughly 𝜂 · (ℒ[0·V𝑖,𝑗 ] − ℒ[V𝑖,𝑗 ] ) by
features and discretize numerical values. First, we re-
the approximation in (4). In addition,
⃒ an external⃒force is move the infrequent feature values and treat them as
introduced by the penalty term, 𝜇 ‖1𝛼>0 ‖1 − 𝑠 , which
a single value “unknown", where the threshold is set to
⃒ ⃒
can be interpreted as the wind flow with its velocity be-
{10, 4} for Criteo, and Avazu respectively. Second, we
ing −𝜇 < 0 when the number of positive particles is
convert all numerical values in into categorical values by
larger than 𝑠 and being 𝜇 > 0 otherwise. As a result,
transforming a value 𝑧 to int(log(𝑧)2 ) if int(𝑧) > 2 and
when the number of positive particles exceeds 𝑠, those
to int(𝑧) − 2 otherwise, which is proposed by the winner
particles with smaller ℒ[0·V𝑖,𝑗 ] − ℒ[V𝑖,𝑗 ] tend to be neg-
of Criteo Competition 2 . Third, we randomly split all
ative, and vice versa; see Figure 2 for further illustration
training samples into 80% for training, 10% for validation,
of the proposed algorithm.
and 10% for testing.

4. Experiments 4.2. Baselines
To validate the performance of our proposed framework, We compare our proposed pruning method HAM with
we conduct extensive experiments on three real-world several baseline methods below.
recommendation datasets. Through the experiments, we
• Uniform: The method assigns a uniform embedding
seek to answer the following three questions: (i) How
size for all feature fields;
does our proposed method, HAM, perform compared with
other auxiliary mask pruning (AMP) methods? (ii) How • Auxiliary Mask Pruning: We compare our proposed
does our proposed framework perform compared with method with other common approaches that apply an
the existing field-wise embedding size search methods in 2
https://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf
auxiliary mask to the embeddings. To be specific, we regularity, we employ 𝜆ℛ(V) with 𝜆 = 10−3 for train-
select three representative types of the auxiliary mask
ing models on MovieLens-1M, and use the pair-wise
listed in Table 1 and apply the same one-shot NAS cosine similarities 𝜆ℛ(V̄) with 𝜆 = 10−6 on Avazu. We
framework as described in Algorithm 1: use PyTorch to implement our method and train it with
(i) SAM: the embeddings are masked directly by aux- mini-batch size 2048 on a single 16G-Memory Nvidia
iliary weights 0 ≤ 𝛼 ≤ 1, which is equivalent to Tesla V100.
DNIS [1].
(ii) SAM-GS: the auxiliary mask is obtained by the 4.4. Performance Comparison
Gumbel-sigmoid function of auxiliary parameters We next present detailed comparisons with other meth-
𝛼 ∈ (0, 1) and 𝑢 ∼ Uniform(0,1) be;ow ods. We adopt AUC (Area Under the ROC Curve) on test
𝛼 𝑢 datasets to measure the performance of models and the
Sigmoid[(log( ) + log( ))/𝜆]. number of embedding parameters to measure memory
1−𝛼 1−𝑢
usage.
(iii) HAM-p: the mask is generated by Bernoulli random Comparison with other AMP Methods. We first
variables with parameters 𝑝’s when forwarding the compare our proposed AMP methods to other common
network. The gradient for 𝑝’s is calculated by STE. AMP methods listed in Table 1 on MovieLens-1M. To
fairly compare the performance, we not only consider
In the retraining stage, we keep the embedding com- comparing the test AUC under the same target total em-
ponents with top-𝑠 auxiliary weights for fair compar- bedding size 𝑠 but also take the number of model pa-
isons. rameters into account. In Figure 3, we report the aver-
• AutoDim [2]: the state-of-the-art NAS-based method age values of test AUC under the same total embedding
in the literature, which outperforms a list of search size and the best results among 10 independent runs in
methods [8, 10, 5, 7]; see Section 3.4 in [2]. terms of AUC and plot Test AUC - # Parameter curve on
MovieLens-1M. For each method, three measurements
are provided from left to right with the target total em-
4.3. Implementation Details bedding size 𝑠 = 14, 28, 42 respectively. We drop the
Base Model architecture. We adopt four representa- Test AUC - # Parameter curve of the method with the
tive CTR-prediction models in the literature: FM [26], worst performance at the bottom. We observe that: (a)
DeepFM [27], AutoInt [28], DCN-V2 [31], as our base fixing the target total embedding size, we can tell that
models to compare their performance. For FM, DeepFM, our method HAM outperforms all other methods on four
and AutoInt, we add an extra embedding vector layer base models with regard to the value of test AUC. As il-
between the feature interaction layer and the embedding lustrated in Test AUC - # Parameter curves, with the same
layer as proposed in [10]. Linear transformations (with- budget of total embedding size, HAM tends to assign larger
out bias) are applied to mixed-dimension embeddings so sizes to informative feature fields with larger vocabulary
that the transformed embedding vectors are of the same sizes compared with other methods. This should be at-
size, which is set as 16 in our experiments. This is not tributed to the gradient dynamics described in Section 3.4
required for DCN-V2 due to the bit-wise crossing mecha- which helps to identify the important embedding compo-
nism. In the pretraining stage, the base dimension 𝑑𝑗 for nents. As a result, HAM obtains models with higher AUC
each feature field 𝑗 is chosen as min(16, 𝐶𝑗 ) where 𝐶𝑗 under the same total embedding size; (b) it is interest-
is the field dimension. ing to observe that SAM also assigns larger embedding
Optimization. We follow the common setup in the lit- sizes to relatively un-informative features compared to
erature to employ Adam optimizer with the learning rate other methods with the same total embedding size, which
of 10−3 to optimize model parameters in all three stages verify the findings in [12] that the magnitudes of auxil-
and use SGD optimizer to optimize auxiliary weights in iary weights may not indicate the importance; (c) HAM
the search stage. To fairly compare different AMP meth- exhibits its superiority, not only on the vector-wise fea-
ods, the number of search epochs is fixed as 10, and the ture crossing models (FM, DeepFM, AutoInt), but also
learning rates of auxiliary parameters are chosen from on the bit-wise feature crossing model (DCN-V2) where
the search grid {1, 10−1 , 10−2 , 10−3 }, and the tempera- Uniform serves as a strong baseline. The performance
ture of the sigmoid function is chosen from the search of HAM should be credited to the hard selection caused
grid {1, 10−1 , 10−2 }. To obtain the best results of each by indicator functions. With the deterministic binary
method, we pick 10−2 as the learning rate of SGD for mask, the masked model evaluated in the search stage is
SAM, SAM-p, and HAM-p, 10−3 for our method HAM. In equivalent to the selected model in the retraining stage,
HAM, the initial value of auxiliary weights 𝜖 = 0.01, and as illustrated in Table 1. As a result, we conclude that
the hyperparameter 𝜇 = 5 × 10−5 . For the orthogonal HAM exhibits stable performance on all four base models
FM DeepFM AutoInt DCN-V2
0.85 0.85
0.84 0.84
0.84
0.84
Test AUC

Test AUC

Test AUC
0.82
0.83
0.82 0.83 0.80
0.82
0.82 0.78
0.80 0.81
Uniform
s=14 s=28 s=42 s=14 s=28 s=42 s=14 s=28 s=42 s=14 s=28 s=42
Total Embedding Size Total Embedding Size Total Embedding Size Total Embedding Size SAM
SAM-GS
FM DeepFM AutoInt DCN-V2 HAM-p
0.85
HAM
0.85
0.84
0.84 0.84
Test AUC

Test AUC

Test AUC
0.84
0.83

0.83 0.83 0.83
0.82

0.81 0.82
0.82 0.82
0.2 0.3 0.4 0.1 0.2 0.3 0.1 0.2 0.3 0.2 0.4
# Parameters (×105 ) # Parameters (×105 ) # Parameters (×105 ) # Parameters (×105 )
Figure 3: Performance comparison of different auxiliary mask pruning methods on MovieLens-1M. For each method, three
measurements from left to right are reported with the target total embedding size 𝑠 = 14, 28, 42 respectively.

while SAM, SAM-GS, and HAM-p suffer from instability given the same pre-trained model, whereas AutoDim re-
and suboptimality in some circumstances. quires pretraining the model once more if changing the
candidate embedding size.
HAM AutoDim
Criteo Avazu Pretrain with SO
0.7900
0.8125
0.017 0.077 0.033 0.051 0.075 0.028 0.0091 0.35
0.7875
Test AUC

Test AUC

0.8100 0.17 0.058 0.0055 0.031 0.0036 0.011 0.023 0.30
0.7850
0.8075 0.27 0.14 0.046 0.045 0.013 0.091 0.043 0.25
Pretrain without SO

0.7825
0.8050 0.075 0.34 0.21 0.046 0.066 0.06 0.053
2 4 6 10 12 14 0.20
# Parameters (×106 ) # Parameters (×106 )
0.1 0.08 0.22 0.3 0.04 0.061 0.025
0.15
Figure 4: Performance comparison between HAM and Au- 0.17 0.092 0.25 0.3 0.11 0.028 0.05
toDim. ∘ - FM; △ - DeepFM; ⋆ - AutoInt; □ - DCN-V2. 0.10
0.065 0.12 0.16 0.37 0.14 0.36 0.021
0.05
Comparison with AutoDim. We also compare our 0.0092 0.042 0.077 0.14 0.072 0.016 0.26
method with state-of-art NAS-based method AutoDim
with {2, 4, 8} as the candidate embedding size on Figure 5: Cosine similarities between column vectors from
Criteo and Avazu datasets in Figure 4. In our method, the embedding table of genre when pretraining DCN-V2 on
the total embedding size 𝑠 is set to be 90, 50 for Criteo and MovieLens-1M with and without SO.
Avazu respectively. Our method outperforms AutoDim
by finding models with comparably higher AUC and
smaller sizes. In addition, from computational perspec-
Table 2
tives, our approach HAM has several advantages over The test AUC gain by Algorithm 1 with SO under different
AutoDim: (a) HAM has a larger search space due to the target embedding sizes on MovieLens-1M and Avazu
structural masking mechanism. For each feature, its can-
Test AUC
didate embedding size range from 0 to its base dimension Data Size 𝑠
FM DeepFM AutoInt DCN-
𝑑𝑗 . On the contrary, AutoDim requires manually pre- V2
specifying a set of candidate embedding sizes for each 14 +.0006 +.0029 +.0013 +.0012
feature; (b) HAM is more computationally efficient. We 28 +.0018 +.0034 +.0018 +.0027
ML-1M
emphasize that AutoDim introduces extra space and time 42 +.0036 +.0033 +.0009 +.0037
complexity by the operations of lifting, batch normaliza- 44 +.0034 +.0039 +.0026 +.0031
Avazu
tion, and aggregation for each candidate size while HAM 66 +.0039 +.0044 +.0020 +.0033
only requires extra element-wise product between the The results are obtained by averaging 10 runs and the bold font
binary mask and embedding vectors. Moreover, HAM can indicates statistically significant under two-sided t-tests
output multiple models of different total embedding sizes (𝑝 < 0.05).
On the Orthogonal Regularity. We further analyze 5. Conclusions
the effect of the orthogonal regularity proposed in our
framework. In Figure 5, we visualize the cosine simi- In this paper, we propose a general auxiliary masking
larities between column vectors from the embedding of pruning framework to search the embedding sizes for dif-
genre when pretraining DCN-V2 on MovieLens-1M with ferent feature fields adaptively in today’s recommender
and without SO where the embedding size is set as 8. It systems. The proposed method leverages the indicator
is clear to see that SO helps to reduce the correlations function to search candidate models exactly, which is
within those embedding column vectors. Moreover, we efficient and can be easily applied to various models.
compare the performance of models selected by our pro- Extensive experiments demonstrate it can effectively re-
posed Algorithm 1 with the one without SO in Table 2. move redundant embedding dimensions without great
Statistical significant gains in test AUC can be observed performance loss.
for MovieLens-1M and Avazu on all base models while
the training time per epoch with SO only increases by
about 1.6 times compared to the time without SO.
Acknowledgments
The authors are grateful to anonymous reviewers for
4.5. Discussions and Future Work their constructive comments that greatly improved the
presentation of this paper. T. Xiao thanks his Ph.D. advi-
Multi-stage Search. As the initial base dimension
sor Dr. Krishnakumar Balasubramanian for supports and
for each feature fields it tunable, we observe a gain,
helpful discussions during the internship at ByteDance.
∼0.3%, in test AUC from preliminary experiments on
MovieLens-1M by searching via HAM with a smaller ini-
tial size = 8. This observation confirms the claim made References
in recent works [38] that enlarging search space is un-
beneficial or even detrimental to existing NAS methods. [1] W. Cheng, Y. Shen, L. Huang, Differentiable neu-
To address this issue, our method can also be adapted to ral input search for recommender systems, arXiv
design a multi-stage search strategy with a sufficiently preprint arXiv:2006.04466 (2020).
larger initial size and stage-wisely prune embeddings [2] X. Zhao, H. Liu, H. Liu, J. Tang, W. Guo, J. Shi,
with decreasing target embedding size 𝑠. S. Wang, H. Gao, B. Long, Memory-efficient em-
Comparison with PEP. To justify the advantages bedding for recommendations, arXiv preprint
of our proposed framework, we also discuss the most arXiv:2006.14827 (2020).
recent Plug-in Embedding Pruning (PEP) [14] method. [3] S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning
As introduced before, PEP is an unstructured pruning based recommender system: A survey and new
method that retains a large number of zero parameters perspectives, ACM Computing Surveys (CSUR) 52
and requires the technique of sparse matrix storage. On (2019) 1–38.
the contrary, our method can not only prune the em- [4] P. Covington, J. Adams, E. Sargin, Deep neural net-
bedding columns explicitly but also reduce the parame- works for youtube recommendations, in: Proceed-
ters in the dense layers while PEP cannot. For example, ings of the 10th ACM conference on recommender
HAM (𝑠 = 28) can obtain a lighter DCN-V2 model on systems, 2016, pp. 191–198.
MovienLens-1M with only ∼8% number of parameters [5] X. Zhao, C. Wang, M. Chen, X. Zheng, X. Liu,
(3,281/40,241) in the dense layers comparing to the model J. Tang, Autoemb: Automated embedding dimen-
with a uniform embedding size 16. sionality search in streaming recommendations,
Comparison with SSEDS. During the revisions of arXiv preprint arXiv:2002.11252 (2020).
this paper, we noticed that a concurrent work [39] also [6] H. Liu, X. Zhao, C. Wang, X. Liu, J. Tang, Auto-
proposes to prune embeddings column-wisely as well mated embedding size search in deep recommender
as row-wisely using indicator functions. Our work is systems, in: Proceedings of the 43rd International
different from theirs in two aspects: (i) we focus on field- ACM SIGIR Conference on Research and Develop-
wise search by pruning nearly orthogonal embedding col- ment in Information Retrieval, 2020, pp. 2307–2316.
umn vectors; (ii) we use a gradient-descent-based method [7] M. R. Joglekar, C. Li, M. Chen, T. Xu, X. Wang,
in the search stage to solve the bilevel problem on the J. K. Adams, P. Khaitan, J. Liu, Q. V. Le, Neural in-
training and validation set; the gradient dynamics enable put search for large scale recommendation models,
us to re-activate (un-mask) masked embeddings while in: Proceedings of the 26th ACM SIGKDD Interna-
SSEDS directly mask all components with small gradient tional Conference on Knowledge Discovery & Data
magnitudes. Comparing our method with SSEDS and Mining, 2020, pp. 2387–2397.
incorporating our method into SSEDS is our future work. [8] T. Chen, L. Li, Y. Sun, Differentiable product quan-
tization for end-to-end embedding compression,
in: International Conference on Machine Learning, or propagating gradients through stochastic neu-
PMLR, 2020, pp. 1617–1626. rons for conditional computation, arXiv preprint
[9] W.-C. Kang, D. Z. Cheng, T. Chen, X. Yi, D. Lin, arXiv:1308.3432 (2013).
L. Hong, E. H. Chi, Learning multi-granular quan- [23] E. Jang, S. Gu, B. Poole, Categorical reparameteri-
tized embeddings for large-vocab categorical fea- zation with gumbel-softmax, in: International Con-
tures in recommender systems, in: Companion ference on Learning Representations, 2016.
Proceedings of the Web Conference 2020, 2020, pp. [24] S. Srinivas, A. Subramanya, R. Venkatesh Babu,
562–566. Training sparse neural networks, in: Proceedings
[10] A. A. Ginart, M. Naumov, D. Mudigere, J. Yang, of the IEEE conference on computer vision and pat-
J. Zou, Mixed dimension embeddings with applica- tern recognition workshops, 2017, pp. 138–145.
tion to memory-efficient recommendation systems, [25] M. Ye, D. Choudhary, J. Yu, E. Wen, Z. Chen, J. Yang,
in: 2021 IEEE International Symposium on Infor- J. Park, Q. Liu, A. Kejariwal, Adaptive dense-to-
mation Theory (ISIT), IEEE, 2021, pp. 2786–2791. sparse paradigm for pruning online recommen-
[11] H. Liu, K. Simonyan, Y. Yang, Darts: Differentiable dation system with non-stationary data, arXiv
architecture search, in: International Conference preprint arXiv:2010.08655 (2020).
on Learning Representations, 2018. [26] S. Rendle, Factorization machines, in: 2010 IEEE In-
[12] R. Wang, M. Cheng, X. Chen, X. Tang, C.-J. Hsieh, ternational conference on data mining, IEEE, 2010,
Rethinking architecture selection in differentiable pp. 995–1000.
nas, arXiv preprint arXiv:2108.04392 (2021). [27] H. Guo, R. Tang, Y. Ye, Z. Li, X. He, Deepfm: a
[13] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, factorization-machine based neural network for ctr
Q. Le, Understanding and simplifying one-shot prediction, in: Proceedings of the 26th International
architecture search, in: International Conference Joint Conference on Artificial Intelligence, 2017, pp.
on Machine Learning, PMLR, 2018, pp. 550–559. 1725–1731.
[14] S. Liu, C. Gao, Y. Chen, D. Jin, Y. Li, Learnable em- [28] W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang,
bedding sizes for recommender systems, in: Inter- J. Tang, Autoint: Automatic feature interaction
national Conference on Learning Representations, learning via self-attentive neural networks, in: Pro-
2020. ceedings of the 28th ACM International Conference
[15] D. Blalock, J. J. G. Ortiz, J. Frankle, J. Guttag, What is on Information and Knowledge Management, 2019,
the state of neural network pruning?, arXiv preprint pp. 1161–1170.
arXiv:2003.03033 (2020). [29] R. Wang, B. Fu, G. Fu, M. Wang, Deep & cross
[16] P. Savarese, H. Silva, M. Maire, Winning the lottery network for ad click predictions, in: Proceedings
with continuous sparsification, in: Advances in of the ADKDD’17, 2017, pp. 1–7.
Neural Information Processing Systems, volume 33, [30] X. He, T.-S. Chua, Neural factorization machines for
Curran Associates, Inc., 2020, pp. 11380–11390. sparse predictive analytics, in: Proceedings of the
[17] C. Louizos, M. Welling, D. P. Kingma, Learning 40th International ACM SIGIR conference on Re-
sparse neural networks through l_0 regularization, search and Development in Information Retrieval,
in: International Conference on Learning Represen- 2017, pp. 355–364.
tations, 2018. [31] R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin,
[18] S. Xie, H. Zheng, C. Liu, L. Lin, Snas: stochastic L. Hong, E. Chi, Dcn v2: Improved deep & cross
neural architecture search, in: International Con- network and practical lessons for web-scale learn-
ference on Learning Representations, 2018. ing to rank systems, in: Proceedings of the Web
[19] Y. Guo, A. Yao, Y. Chen, Dynamic network surgery Conference 2021, 2021, pp. 1785–1797.
for efficient dnns, in: Proceedings of the 30th In- [32] N. Bansal, X. Chen, Z. Wang, Can we gain more
ternational Conference on Neural Information Pro- from orthogonality regularizations in training deep
cessing Systems, 2016, pp. 1387–1395. networks?, Advances in Neural Information Pro-
[20] X. Xiao, Z. Wang, Autoprune: Automatic network cessing Systems 31 (2018) 4261–4271.
pruning by regularizing auxiliary parameters, Ad- [33] P. Rodríguez, J. Gonzalez, G. Cucurull, J. M. Gonfaus,
vances in Neural Information Processing Systems X. Roca, Regularizing cnns with locally constrained
32 (NeurIPS 2019) 32 (2019). decorrelations, arXiv preprint arXiv:1611.01967
[21] X. Zhou, W. Zhang, H. Xu, T. Zhang, Effective spar- (2016).
sification of neural networks with global sparsity [34] S. Yan, Y. Zheng, W. Ao, X. Zeng, M. Zhang, Does
constraint, in: Proceedings of the IEEE/CVF Confer- unsupervised architecture representation learning
ence on Computer Vision and Pattern Recognition, help neural architecture search?, Advances in Neu-
2021, pp. 3599–3608. ral Information Processing Systems 33 (2020) 12486–
[22] Y. Bengio, N. Léonard, A. Courville, Estimating 12498.
[35] H. Hazimeh, Z. Zhao, A. Chowdhery, M. Sathi-
amoorthy, Y. Chen, R. Mazumder, L. Hong, E. H. Chi,
Dselect-k: Differentiable selection in the mixture
of experts with applications to multi-task learning,
arXiv preprint arXiv:2106.03760 (2021).
[36] P. Molchanov, S. Tyree, T. Karras, T. Aila,
J. Kautz, Pruning convolutional neural networks
for resource efficient inference, arXiv preprint
arXiv:1611.06440 (2016).
[37] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, J. Kautz,
Importance estimation for neural network pruning,
in: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2019,
pp. 11264–11272.
[38] Y. Ci, C. Lin, M. Sun, B. Chen, H. Zhang, W. Ouyang,
Evolving search space for neural architecture
search, in: Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision, 2021, pp.
6659–6669.
[39] L. Qu, Y. Ye, N. Tang, L. Zhang, Y. Shi, H. Yin, Single-
shot embedding dimension search in recommender
system, arXiv preprint arXiv:2204.03281 (2022).