=Paper=
{{Paper
|id=Vol-3317/paper2
|storemode=property
|title=Field-wise Embedding Size Search via Structural Hard Auxiliary Mask Pruning for Click-Through Rate Prediction
|pdfUrl=https://ceur-ws.org/Vol-3317/Paper2.pdf
|volume=Vol-3317
|authors=Tesi Xiao,Xia Xiao,Ming Chen,Youlong Chen
|dblpUrl=https://dblp.org/rec/conf/cikm/XiaoXCC22
}}
==Field-wise Embedding Size Search via Structural Hard Auxiliary Mask Pruning for Click-Through Rate Prediction==
Field-wise Embedding Size Search via Structural Hard Auxiliary Mask Pruning for Click-Through Rate Prediction Tesi Xiao1 , Xia Xiao2 , Ming Chen2 and Youlong Cheng2,* 1 University of California, Davis, USA 2 ByteDance, Mountain View, USA Abstract Feature embeddings are one of the most important steps when training deep learning based Click-Through Rate prediction models, which map high-dimensional sparse features to dense embedding vectors. Classic human-crafted embedding size selection methods are shown to be βsub-optimal" in terms of the trade-off between memory usage and model capacity. The trending methods in Neural Architecture Search (NAS) have demonstrated their efficiency to search for embedding sizes. However, most existing NAS-based approaches suffer from expensive computational costs, the curse of dimensionality of the search space, and the discrepancy between continuous search space and discrete candidate space. Other methods that prune embeddings in an unstructured manner fail to explicitly reduce the computational costs. In this paper, to address those limitations, we propose a novel strategy that searches for the optimal mixed-dimension embedding scheme by structurally pruning a super-net via Hard Auxiliary Mask. Our method aims to directly search candidate models in the discrete space using a simple and efficient gradient-based method. Furthermore, we introduce orthogonal regularity on embedding tables to reduce correlations within embedding columns and enhance representation capacity. Extensive experiments demonstrate it can effectively remove redundant embedding dimensions without great performance loss. Keywords CTR Prediction, Embedding Size, Network Pruning, Neural Architecture Search 1. Introduction embedding dimension for all features, either due to the prerequisites of the model input or simply for the sake Deep learning based recommender systems (DLRS) have of convenience. If the embedding dimensions are uni- demonstrated superior performance over more tradi- formly high, it leads to increased memory usage and tional recommendation techniques [3]. The success of computational cost, as it fails to handle the heterogeneity DLRS is mainly attributed to their ability to learn mean- among different features. As a concrete example, encod- ingful representations with categorical features, that sub- ing features with few unique values like gender with large sequently help with modeling the non-linear user-item embedding vectors leads to over-parametrization. Con- relationships efficiently. Indeed, real-world recommenda- versely, the selected embedding size may be insufficient tion tasks usually involve a large number of categorical for highly-predictive features with large cardinalities, feature fields with high cardinality (i.e. the number of such as the userβs last search query. Therefore, finding unique values or vocabulary size) [4]. One-Hot Encoding appropriate embedding dimensions for different feature is a standard way to represent such categorical features. fields is essential. To reduce the memory cost of One-Hot Encoding, DLRS The existing works towards automating embedding first maps the high-dimensional one-hot vectors into size search can be categorized into two groups: (i) field- real-valued dense vectors via the embedding layer. Such wise search [5, 2, 6]; (ii) vocabulary-wise search [7, 8, 9, 10]. embedded vectors are subsequently used in predictive The former group aims to assign different embedding models for obtaining the required recommendations. sizes to different feature fields, while embeddings in the In this pipeline, the choice of the dimension of the same feature field share the same dimension. The lat- embedding vectors, also known as embedding dimension, ter further attempts to find different embedding sizes plays a crucial role in the overall performance of the to different feature values within the same feature field, DLRS. Most existing models assign a fixed and uniform which is generally based on the frequencies of feature DL4SRβ22: Workshop on Deep Learning for Search and Recommen- values. Although it has been shown that the latter group dation, co-located with the 31st ACM International Conference on can significantly reduce the model size without a great Information and Knowledge Management (CIKM), October 17-21, 2022, performance drop, the second group of works suffers Atlanta, USA from several challenges and drawbacks (we refer readers * Corresponding author. to Section 3.4 in [2] for details): (a) a large number of $ texiao@ucdavis.edu (T. Xiao); x.xiaxiao@bytedance.com (X. Xiao); ming.chen@bytedance.com (M. Chen); unique values in each feature field leads to a huge search youlong.chen@bytedance.com (Y. Cheng) space in which the optimal solution is difficult to find; Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). (b) the feature values frequencies are time-varying and CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) π¦! Our Framework DNIS [Cheng et al., 2020] Interaction Layers Magnitude-based Pruning Soft Auxiliary Mask 0.6 0.1 0.9 0.8 β -0.36 0.03 -0.18 0.96 1 0 1 1 0 1 1 0 1 1 Hard Auxiliary Mask Embedding Vector -0.6 0.3 -0.2 0.12 Output β β β Embedding 1.2 1 -0.9 -0.6 0.3 -0.2 0.12 0.1 0.3 -2.1 Vector Embedding Vector AutoDim [Zhao et al., 2020] Embedding Look-up 0.4 -0.2 -0.3 0.2 -1 0.1 -0.6 -1 0.2 -0.1 Linear Transform / Zero Padding -0.2 0.1 -0.8 -0.5 0.7 -0.3 0.2 0.1 0.3 -2.1 Gumbel-softmax Embedding Table V! V" V# Output 1.2 1 -0.9 -0.6 0.3 -0.2 0.12 1.2 -0.1 0.1 + Embedding 0.3 0.2 0.7 0.1 0.2 0.5 -0.4 0.7 0.6 -0.4 Batch Norm + Weighted Sum Vector Embedding Lookup with Memory Sharing Field 1 Field π Field πΎ Figure 1: (left) The framework of our method. The operator β denotes the element-wise product. The gray embedding components are masked and thus can be removed directly. (right) The basic ideas of DNIS [1] and AutoDim [2]. not pre-known in real-time recommender system; (c) it is into the same dimension with batch normalization so difficult to handle embeddings with different dimensions that they can be aggregated with architecture weights for the same feature field in the state of art DLRS due to of candidate sizes and fed into the subsequent layers to the feature crossing mechanism. As a result, we fix our obtain the prediction. Cheng et al. [1] also leverage the attention to the field-wise embedding size search in this idea of DARTS but with the architecture weights being work. the weights of sub-blocks of embedding matrices. The majority of recent works towards automating em- Despite the obtained promising results, the existing bedding size search are based on the trending ideas in methods have certain limitations which arise from the Neural Architecture Search (NAS). NAS has been an ac- discrepancy between the large discrete candidate space tive research area recently, which is composed of three and its continuous relaxation. On the one hand, the elements: (i) a search space π of all candidate models; DARTS-based methods in [5, 2, 1] are algorithmically (ii) a search strategy that goes through models in π; (iii) efficient but search in an augmented continuous space; a performance estimation strategy that evaluates the per- it has been shown that the discrepancy may lead to formance of the selected model. To leverage the NAS finding unsatisfactory candidates because the magnitude techniques to search embedding sizes for each feature of architecture weights does not necessarily indicate field, Joglekar et al. [7] formulate the search space by dis- how much the operation contributes to the overall cretizing an embedding matrix into several sub-matrices performance [12]. On the other hand, while the RL-based and take a Reinforcement Learning (RL) approach with methods in [7, 6] and the hard selection variant in [5] the Trainer-Controller framework for search and per- directly search and evaluate models in the original formance estimation, in which the controller samples discrete candidate space, they are computationally heavy different embedding blocks to maximize the reward ob- and thus not favored by the large-scale DLRS. Therefore, served from the trainer that trains the model with the a natural question follows: controllerβs choices. Liu et al. [6] also adopt the RL-based search which starts with small embedding sizes and takes Is it possible to come up with an efficient method that actions to keep or enlarge the size based on the predic- searches straight through the discrete candidate space of tion error. Their proposed method requires sequentially all possible embedding sizes? training DRLS and the policy network repeatedly until convergence, which is computationally expensive. In this work, we provide a positive answer by propos- Given the rise of one-shot NAS aiming to search ing an end-to-end field-aware embedding size search and evaluate candidate models together in an over- method leveraging the Straight Through Estimator (STE) parametrized supernet, Zhao et al. [5, 2] follow the idea of gradients for non-differentiable functions. Motivated of Differential Neural Architecture Search (DARTS)[11] by one-shot NAS and network pruning, our method seeks and solve a continuous bilevel optimization by the to adaptively search for a good sub-network in a pre- gradient-based algorithm. To be specific, they define trained supernet by masking redundant dimensions. The a search space that includes all possible embedding sizes mask, named as Hard Auxiliary Mask (HAM), is real- of each feature field; for instance in [2], 5 candidate ized by the indicator function of auxiliary weights. Fur- sizes {2, 8, 16, 24, 32} are selected for each feature field. thermore, to reduce the information redundancy and These embedding vectors in different sizes are then lifted boost the search performance, we impose the orthogonal regularity on embedding tables and train models with puts, existing works attempt to cast it into a differentiable regularized loss functions. Contributions and nov- problem and provide gradient-based search algorithms. A elty of our work are: (i) we propose to prune embed- straightforward method is to directly replace binary val- ding tables column-wisely via hard auxiliary masks for ues by smooth functions of auxiliary parameters during the CTR prediction models, which can effectively com- the forward pass, such as sigmoid [16], Gumbel-sigmoid press the model; (ii) we offer a gradient-based pruning [17], softmax [11], Gumbel-softmax [18], and piece-wise method and delve into its dynamics with mathematical linear [19]. However, all these methods, which we name insights; (iii) we introduce orthogonal regularity for train- Soft Auxiliary Mask (SAM), suffer from the discrepancy ing recommendation models and experimentally show caused by continuous relaxation. Other methods, which that orthogonal regularity can help to achieve significant we refer to Hard Auxiliary Mask (HAM), preserve binary improvement; (iv) we obtain state-of-the-art results on values in the forward pass with indicator functions [20] various modern models and our method is scalable on or Bernoulli random variables [21] and optimize parame- large datasets. ters using the straight-through estimator (STE) [22, 23]. A detailed comparison of four representative auxiliary masking methods for approximating binary values is pro- 2. Related Work vided in Table 1. Embedding Size Search for Recommender Systems. To deal with the heterogeneity among different features 3. Methodology and reduce the memory cost of sizable embedding tables, several recent works introduce a new paradigm of mixed- 3.1. Preliminaries dimension embedding tables. In particular, Ginart et al. [10] propose to substitute the uniform-dimension embed- Here we briefly introduce the mechanism of DLRS and dings V β RπΆΓπ with V = EP, where E β RπΆΓπ is define the terms used throughout the paper. the smaller embeddings with π βͺ π and P β Rπ Γπ is the Model Architecture. Consider the data input projection that lift the embeddings into the base dimen- involves πΎ features fields from users, items, their sion π for feature crossing purposes. A popularity-based interactions, and contextual information. We de- rule is further designed to find the dimension π for each note these raw features by multiple one-hot vectors feature field which is too rough and cannot handle impor- x1 β RπΆ1 , . . . , xπΎ β RπΆπΎ , where field dimensions tant features with low frequency. In addition, plenty of πΆ1 , . . . πΆπΎ are the cardinalities of feature fields1 . Ar- works seek to provide an end-to-end framework leverag- chitectures of DLRS often consist of three key com- ing the advances in NAS, including RL approaches [7, 6] ponents: (i) embedding layers with tables V1 β and differentiable NAS methods [5, 13, 2, 1]. The most RπΆ1 Γπ1 , . . . , VπΎ β RπΆπΎ ΓππΎ that map sparse one-hot relevant work to us is [1], which employs a soft auxiliary vectors to dense vectors in a low dimensional embedding mask to prune the embeddings structurally. However, it space by vπ = Vπβ€ xπ ; (ii) feature interaction layers that requires magnitude-based pruning to derive fine-grained model complex feature crossing using embedding vec- mixed embedding, in which the discrepancy between the tors; (iii) output layers that make final predictions for continuous relaxed search space and discrete candidate specific recommendation tasks. space could lead to a relatively great loss in performance. The feature crossing techniques of existing models The most recent work [14] proposes a Plug-in Embed- fall into two types - vector-wise and bit-wise. Models ding Pruning (PEP) approach to prune embedding tables with vector-wise crossing explicitly introduce interac- into sparse matrices to reduce storage costs, Albeit sig- tions by the inner product, such as Factorization Machine nificantly reducing the number of non-zero embedding (FM) [26], DeepFM [27] and AutoInt [28]. The bit-wise parameters, this type of unstructured pruning method crossing, in contrast, implicitly adds interaction terms fails to explicitly reduce the embedding size. by element-wise operations, such as the outer product Neural Network Pruning via Auxiliary Masks. To in Deep Cross Network (DCN) [29], and the Hadamard deploy deep learning models on resource-limited devices, product in NFM [30] and DCN-V2 [31]. In this work, we there is no lack of work in neural network pruning; see deploy our framework to the Click-Through Rate (CTR) [15] and the reference therein. One popular approach prediction problem with four base models: FM, DeepFM, towards solving this problem is to learn a pruning mask AutoInt, and DCN-V2. through auxiliary parameters, which considers pruning Notations. Throughout this paper, we use β for as an optimization problem that tends to minimize the the Hadamard (element-wise) product of two vectors. supervised loss of the masked neural network with a The indicator function of a scalar-valued πΌ is defined certain sparsity constraint. As learning the optimal mask as 1πΌ>0 = 1 if πΌ > 0; 1πΌ>0 = 0 if πΌ β€ 0. The is indeed a discrete optimization problem for binary in- 1 Numeric features are converted into categorical data by binning. Table 1 A List of Representative Auxiliary Masking Methods Mask Forward Backward Modeleval = Modelsel Deterministic πΌ β [0, 1][1]; Sigmoid(πΌ), πΌ β R[16] autograd β Soft π’ Sigmoid(log πΌ + log( 1βπ’ )) [17], Stochastic autograd β πΌ β R, π’ βΌ Uniform(0,1) Stochastic Bernoulli(p) STE [24], Gumbel-STE [21] β Hard Deterministic 1πΌ>0 STE [20, 25] (Our Approach) β Remark. Modeleval stands for the evaluated masked model; Modelsel is the final model selected by the algorithm. indicator function of a vector πΌ β Rπ is defined as probability score π¦Λ is given as follows: 1πΌ>0 = [1πΌ1 >0 , . . . , 1πΌπ >0 ]β€ . The identity matrix is π¦Λ = π(v β 1πΌ>0 |Ξ) = π(x|VΜπΌ , Ξ) = π(x|πΌ, V, Ξ), written as I and the function diag(Β·) returns a diagonal πΌ1 πΌ matrix with its diagonal entries as the input. We use where VΜπΌ = {VΜ1 , . . . , VΜπΎπΎ } is the pruned embed- β Β· β1 , β Β· βπΉ for the β1 norm of vectors and the Frobenius ding layer with = Vπ diag(1πΌπ >0 ), πΌπ norm of matrices respectively. VΜπ π = 1, . . . πΎ. πΌπ We emphasize that embedding tables VΜπ are pruned 3.2. Background column-wisely in a structural manner unlike PEP [14] that prunes entry-wisely. In general, the CTR prediction models takes the concate- Orthogonal Regularity Given the embedding table nation of all feature fields from a user-item pair, denoted Vπ β RπΆπ Γππ for feature field π, its column vectors, de- by x = [x1 ; x2 ; . . . ; xπΎ ] as the input vector. Given the noted by Vπ,1 , . . . , Vπ,ππ β RπΆπ , can be regarded as ππ embedding layer V = {V1 , V2 , . . . , VπΎ }, the feature different representations of feature π in the embedding interaction layers take the embedding vectors v and feed space. The auxiliary masks above aim to mask relatively the learned hidden states into the output layer to obtain uninformative column vectors so as to reduce the model the prediction. The embedding vectors v are the dense size. Nonetheless, the presence of correlation between encoding of the one-hot input x that can be defined as these vectors may complicate the selection procedure. follows: Specifically, presuming that the most predictive one Vπ,π v = [v1 ; v2 ; . . . ; vπΎ ] has been selected, it would be problematic if we greedily [οΈ ]οΈ select the next column Vπ,π that brings the largest loss = V1β€ x1 ; V2β€ x2 ; . . . ; VπΎ β€ xπΎ := π±x, drop when included in. For instance, if Vπ,π ΜΈβ₯ Vπ,π , where π± is the embedding lookup operator. The predic- we have the following decomposition: Vπ,π = p + pβ₯ , tion score π¦Λ is then calculated with modelsβ other param- where p = πVπ,π β Vπ,π and pβ₯ β₯ p. Therefore, it eters Ξ in the feature interaction layers and output layer would be difficult to determine whether the increments by π¦Λ = π(v|Ξ) = π(π±x|Ξ) = π(x|V, Ξ), where are attributed to the existing direction p or the new fac- π¦Λ β [0, 1] is the predicted probability, π(Β·|Ξ) is the pre- tor pβ₯ . To address this issue, we follow [32] to train diction function of embedding vectors, and π = π β π± is embedding parameters V with Soft Orthogonal (SO) reg- the prediction function of raw inputs. To learn the model ularizations: πΎ parameters V, Ξ, we aim to minimize the Log-loss on βοΈ β(V) = βVπβ€ Vπ β Iβ2πΉ /π2π , (1) the training data, i.e., π=1 min βtrain (V, Ξ) where divisors π2π are introduced to handle heteroge- V,Ξ π neous dimensionality of embedding tables. We also adopt 1 βοΈ [οΈ ]οΈ a relaxed SO regularization in which Vπ replaced by the := β π¦π log(π¦Λπ ) + (1 β π¦π ) log(1 β π¦Λπ ) π π=1 normalized matrix Vπ with unit column vectors, which corresponds to the pair-wise cosine similarities within where π is the total number of training samples. embedding columns [33]. Hard Auxiliary Mask As illustrated in Figure 1, we add auxiliary masks for each embedding dimension slot. 3.3. Framework Specifically, the auxiliary masks are indicator functions of auxiliary parameters πΌ = [πΌ1 ; πΌ2 ; . . . ; πΌπΎ ], where Our proposed framework is motivated by one-shot NAS, πΌπ β Rππ is in the same size of the corresponding embed- which consists of three stages: pretrain, search, and re- ding vector vπ . Provided with the masks, the predicted train. Pretrain. As shown in [34], pre-training architecture Early Stopper for the Search Stage. We can deter- representations improve the downstream architecture mine to stop the search stage if the number of positive search efficiency. Therefore, in the pretraining stage, we auxiliary weights is close to the target size with no signif- train the base model with a large embedding layer V. icant gain in validation AUC since the sign of auxiliary The base dimension ππ for each feature field is deter- weights exactly indicates whether the corresponding em- mined by prior knowledge. In addition, the embedding bedding components are pruned or not. On the contrary, dimension ππ should not exceed the field dimension πΆπ it is hard to deploy the early stopper for the search stage to avoid column-rank-deficiency. The SO regularization using other auxiliary mask pruning methods because the term (1) is added to the mini-batch training loss for the value of validation AUC during the search epochs is un- optimizer to learn near-orthogonal embeddings. The able to represent the performance of the model selected learned model parameters, denoted by Vinit , Ξinit , are in the retraining stage. passed to the search stage as initialization. Retrain. After training {πΌ, V, Ξ} for several epochs Search. Provided with the pre-trained model, the goal in the search step, we obtain a binary mask based on of the search stage is to find the column-wise sparse the sign of πΌ and continue optimizing {V, Ξ} till con- embedding layer that preserves model accuracy, which vergence. The early stopper is deployed for both pre- can be formulated as: training and retraining steps, which terminates training min min βtrain VΜπΌ , Ξ + πββ1πΌ>0 β1 β π β, if model performance on validation data cannot be im- (οΈ )οΈ β β πΌ V,Ξ proved within a few epochs. The overall framework is where β1πΌ>0 β1 counts the number of non-zero embed- described in Algorithm 1. ding columns and π is the target number of non-zero columns. Note that instead of direct regularization on Algorithm 1: Structural HAM Pruning β1πΌ>0 β1 as [16, 20, 25] do, we include the target num- Input: data πtrain , πval , πtest , ber π to reduce instability from batched training and the base dimensions ππ (β€ πΆπ ) for each field π, choice of hyperparameter π. However, given that the ob- target embedding size π , hyperparameters jective function above is non-differentiable when πΌ = 0 π, π > 0; and has a zero gradient anywhere else, traditional gra- β Pretrain: dient descent methods are not applicable. To that end, while stopping criteria not met do we use the straight-through estimator (STE) [22], which obtain a batch of samples from πtrain and replaces the ill-defined gradient in the chain rule with a update V, Ξ regularized with SO (1) by fake gradient. Despite various smooth alternatives used certain optimizer; in the literature, such as sigmoid [16] and piecewise poly- end nomials [35], we adopt the simplest identity function for β Search: initialize πΌ = π1 β; backpropagation, i.e., while stopping criteria not met do πβ πβ π 1πΌ>0 πβ ππΌ πβ obtain a batch of samples from πval and = β = . (2) ππΌ π 1πΌ>0 ππΌ π 1πΌ>0 ππΌ π 1πΌ>0 update πΌ by Eq. (3); Then, the search stage starts with the unmasked model obtain a batch of samples from πtrain and that πΌ0 = π Β· β1 for some small π > 0. The gradient update V, Ξ by certain optimizer; update rules for πΌ at iteration π‘ are given by: end βRetrain: mask the dimensions with 0 where πΌπ‘+1 = πΌπ‘ βπΒ·β(1πΌπ‘ >0 ) βbatch βπΒ·sign(β1πΌπ‘ >0 β1 βπ )1 β, πΌ < 0 and retrain the model until the stopping (3) criteria are met. where π is the learning rate. We will illustrate in Sec- tion 3.4 that the above updates enjoy certain benefits and the last term plays an important role by pushing the optimizer iteratively to evaluate candidate models 3.4. On the Gradient Dynamics with a hard auxiliary mask. Furthermore, to enhance the stability and performance, we implement a multi-step As finding the optimal mask is essentially a combinatorial training through iteratively training auxiliary parameters optimization problem over the set of 2 possible status π on validation data and retraining the original weights, of on-off switches (π is the number of switches), the pro- which attempts to solve the following bi-level optimiza- posed gradient-based search rule given in (3) provides tion problem: an alternative approach to look through the discrete can- didate sets in a continuous space. We elaborate below min βval VΜπΌ , Ξ + πββ1πΌ>0 β1 β π β (οΈ β )οΈ β β β πΌ that the STE-based gradient in (2) is related to the Tay- β (οΈ )οΈ lor approximation for the Leave-One-Out error and the s.t. VΜπΌ , Ξ = arg min βtrain VΜπΌ , Ξ . β penalty term drifts auxiliary variables to find a model of VΜπΌ ,Ξ the target size. if # of positive πΌ ) s < π velocity of Ξ±!,# β π - (β[%&'!,#] ββ['!,#] ) For illustration purpose, we start with the dynamics π£external = π > 0 of the scalar-valued parameter πΌπ,π , the π-th element of Ξ±!,# πΌπ , that controls the mask for the π-th column vector, 0 un-masked V!,# Vπ,π , of the embedding table Vπ . Define the function masked 0 - V! $ ,# $ β(π) := β[πΒ·Vπ,π ] as the value of loss function when Ξ±! $,# $ replacing Vπ,π by π Β· Vπ,π (0 β€ π β€ 1). By the first-order if # of positive πΌ ) s > π π£external = βπ < 0 Taylor approximation, we have β(1) β β(0) β ββ² (0) and β(0) β β(1) β βββ² (1). Moreover, it is worth noting that the STE-based gradient in (2) with regards to πΌπ,π is Figure 2: An intuitive viewpoint of the gradient dynamics. exactly ββ² (1πΌπ,π >0 ), i.e., πβbatch = ββ² (1πΌπ,π >0 ) β β(1) β β(0) the literature in terms of prediction ability, and memory π1πΌπ,π >0 (4) consumption? (iii) How does the soft orthogonality regu- = β[Vπ,π ] β β[0Β·Vπ,π ] . larity boost the performance of our proposed framework? In other words, the proposed gradient calculates the im- In the following, we will first introduce the experimental portance (with batches of data) of the embedding com- setups including datasets, baselines, and implementation ponent Vπ,π using the first-order Taylor approximation. details, and then present results as well as discussions. The above heuristics are also mentioned in [36, 37, 25]. We now consider the dynamics of all the auxiliary 4.1. Datasets and Data Preparation. parameters πΌπ,π βs provided by (3). As illustrated above, We use three benchmark datasets in our experiments: (i) the term β(1πΌπ‘ >0 ) βdata measures the importance of each MovieLens-1M: This dataset contains usersβ ratings (1- embedding component by approximating the difference 5) on movies. We treat samples with ratings greater than between the loss value with un-masked embeddings and 3 as positive samples and samples with ratings below 3 the loss value with a mask on. Furthermore, the sign as negative samples. Neutral samples with a rating equal of the penalty term π in (3) is determined by the com- to 3 are dropped; (ii) Criteo: This dataset has 45 mil- parison between the number of un-masked components lion usersβ clicking records on displayed ads. It contains and the target size π . As an analogue, the dynamics of 26 categorical feature fields and 13 numerical feature auxiliary parameters can be viewed as a particle system fields; (iii) Avazu: This dataset has 40 million clicking in a neighborhood of 0 on real line, in which the particles records on displayed mobile ads. It has 22 feature fields are initialized at the same point π > 0 (i.e., the model spanning from user/device features to ad attributes. We starts with all embeddings un-masked) and the velocity follow the common approach to remove the infrequent of each particle is roughly π Β· (β[0Β·Vπ,π ] β β[Vπ,π ] ) by features and discretize numerical values. First, we re- the approximation in (4). In addition, β an externalβforce is move the infrequent feature values and treat them as introduced by the penalty term, π β1πΌ>0 β1 β π , which a single value βunknown", where the threshold is set to β β can be interpreted as the wind flow with its velocity be- {10, 4} for Criteo, and Avazu respectively. Second, we ing βπ < 0 when the number of positive particles is convert all numerical values in into categorical values by larger than π and being π > 0 otherwise. As a result, transforming a value π§ to int(log(π§)2 ) if int(π§) > 2 and when the number of positive particles exceeds π , those to int(π§) β 2 otherwise, which is proposed by the winner particles with smaller β[0Β·Vπ,π ] β β[Vπ,π ] tend to be neg- of Criteo Competition 2 . Third, we randomly split all ative, and vice versa; see Figure 2 for further illustration training samples into 80% for training, 10% for validation, of the proposed algorithm. and 10% for testing. 4. Experiments 4.2. Baselines To validate the performance of our proposed framework, We compare our proposed pruning method HAM with we conduct extensive experiments on three real-world several baseline methods below. recommendation datasets. Through the experiments, we β’ Uniform: The method assigns a uniform embedding seek to answer the following three questions: (i) How size for all feature fields; does our proposed method, HAM, perform compared with other auxiliary mask pruning (AMP) methods? (ii) How β’ Auxiliary Mask Pruning: We compare our proposed does our proposed framework perform compared with method with other common approaches that apply an the existing field-wise embedding size search methods in 2 https://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf auxiliary mask to the embeddings. To be specific, we regularity, we employ πβ(V) with π = 10β3 for train- select three representative types of the auxiliary mask ing models on MovieLens-1M, and use the pair-wise listed in Table 1 and apply the same one-shot NAS cosine similarities πβ(VΜ) with π = 10β6 on Avazu. We framework as described in Algorithm 1: use PyTorch to implement our method and train it with (i) SAM: the embeddings are masked directly by aux- mini-batch size 2048 on a single 16G-Memory Nvidia iliary weights 0 β€ πΌ β€ 1, which is equivalent to Tesla V100. DNIS [1]. (ii) SAM-GS: the auxiliary mask is obtained by the 4.4. Performance Comparison Gumbel-sigmoid function of auxiliary parameters We next present detailed comparisons with other meth- πΌ β (0, 1) and π’ βΌ Uniform(0,1) be;ow ods. We adopt AUC (Area Under the ROC Curve) on test πΌ π’ datasets to measure the performance of models and the Sigmoid[(log( ) + log( ))/π]. number of embedding parameters to measure memory 1βπΌ 1βπ’ usage. (iii) HAM-p: the mask is generated by Bernoulli random Comparison with other AMP Methods. We first variables with parameters πβs when forwarding the compare our proposed AMP methods to other common network. The gradient for πβs is calculated by STE. AMP methods listed in Table 1 on MovieLens-1M. To fairly compare the performance, we not only consider In the retraining stage, we keep the embedding com- comparing the test AUC under the same target total em- ponents with top-π auxiliary weights for fair compar- bedding size π but also take the number of model pa- isons. rameters into account. In Figure 3, we report the aver- β’ AutoDim [2]: the state-of-the-art NAS-based method age values of test AUC under the same total embedding in the literature, which outperforms a list of search size and the best results among 10 independent runs in methods [8, 10, 5, 7]; see Section 3.4 in [2]. terms of AUC and plot Test AUC - # Parameter curve on MovieLens-1M. For each method, three measurements are provided from left to right with the target total em- 4.3. Implementation Details bedding size π = 14, 28, 42 respectively. We drop the Base Model architecture. We adopt four representa- Test AUC - # Parameter curve of the method with the tive CTR-prediction models in the literature: FM [26], worst performance at the bottom. We observe that: (a) DeepFM [27], AutoInt [28], DCN-V2 [31], as our base fixing the target total embedding size, we can tell that models to compare their performance. For FM, DeepFM, our method HAM outperforms all other methods on four and AutoInt, we add an extra embedding vector layer base models with regard to the value of test AUC. As il- between the feature interaction layer and the embedding lustrated in Test AUC - # Parameter curves, with the same layer as proposed in [10]. Linear transformations (with- budget of total embedding size, HAM tends to assign larger out bias) are applied to mixed-dimension embeddings so sizes to informative feature fields with larger vocabulary that the transformed embedding vectors are of the same sizes compared with other methods. This should be at- size, which is set as 16 in our experiments. This is not tributed to the gradient dynamics described in Section 3.4 required for DCN-V2 due to the bit-wise crossing mecha- which helps to identify the important embedding compo- nism. In the pretraining stage, the base dimension ππ for nents. As a result, HAM obtains models with higher AUC each feature field π is chosen as min(16, πΆπ ) where πΆπ under the same total embedding size; (b) it is interest- is the field dimension. ing to observe that SAM also assigns larger embedding Optimization. We follow the common setup in the lit- sizes to relatively un-informative features compared to erature to employ Adam optimizer with the learning rate other methods with the same total embedding size, which of 10β3 to optimize model parameters in all three stages verify the findings in [12] that the magnitudes of auxil- and use SGD optimizer to optimize auxiliary weights in iary weights may not indicate the importance; (c) HAM the search stage. To fairly compare different AMP meth- exhibits its superiority, not only on the vector-wise fea- ods, the number of search epochs is fixed as 10, and the ture crossing models (FM, DeepFM, AutoInt), but also learning rates of auxiliary parameters are chosen from on the bit-wise feature crossing model (DCN-V2) where the search grid {1, 10β1 , 10β2 , 10β3 }, and the tempera- Uniform serves as a strong baseline. The performance ture of the sigmoid function is chosen from the search of HAM should be credited to the hard selection caused grid {1, 10β1 , 10β2 }. To obtain the best results of each by indicator functions. With the deterministic binary method, we pick 10β2 as the learning rate of SGD for mask, the masked model evaluated in the search stage is SAM, SAM-p, and HAM-p, 10β3 for our method HAM. In equivalent to the selected model in the retraining stage, HAM, the initial value of auxiliary weights π = 0.01, and as illustrated in Table 1. As a result, we conclude that the hyperparameter π = 5 Γ 10β5 . For the orthogonal HAM exhibits stable performance on all four base models FM DeepFM AutoInt DCN-V2 0.85 0.85 0.84 0.84 0.84 0.84 Test AUC Test AUC Test AUC Test AUC 0.82 0.83 0.82 0.83 0.80 0.82 0.82 0.78 0.80 0.81 Uniform s=14 s=28 s=42 s=14 s=28 s=42 s=14 s=28 s=42 s=14 s=28 s=42 Total Embedding Size Total Embedding Size Total Embedding Size Total Embedding Size SAM SAM-GS FM DeepFM AutoInt DCN-V2 HAM-p 0.85 HAM 0.85 0.84 0.84 0.84 Test AUC Test AUC Test AUC Test AUC 0.84 0.83 0.83 0.83 0.83 0.82 0.81 0.82 0.82 0.82 0.2 0.3 0.4 0.1 0.2 0.3 0.1 0.2 0.3 0.2 0.4 # Parameters (Γ105 ) # Parameters (Γ105 ) # Parameters (Γ105 ) # Parameters (Γ105 ) Figure 3: Performance comparison of different auxiliary mask pruning methods on MovieLens-1M. For each method, three measurements from left to right are reported with the target total embedding size π = 14, 28, 42 respectively. while SAM, SAM-GS, and HAM-p suffer from instability given the same pre-trained model, whereas AutoDim re- and suboptimality in some circumstances. quires pretraining the model once more if changing the candidate embedding size. HAM AutoDim Criteo Avazu Pretrain with SO 0.7900 0.8125 0.017 0.077 0.033 0.051 0.075 0.028 0.0091 0.35 0.7875 Test AUC Test AUC 0.8100 0.17 0.058 0.0055 0.031 0.0036 0.011 0.023 0.30 0.7850 0.8075 0.27 0.14 0.046 0.045 0.013 0.091 0.043 0.25 Pretrain without SO 0.7825 0.8050 0.075 0.34 0.21 0.046 0.066 0.06 0.053 2 4 6 10 12 14 0.20 # Parameters (Γ106 ) # Parameters (Γ106 ) 0.1 0.08 0.22 0.3 0.04 0.061 0.025 0.15 Figure 4: Performance comparison between HAM and Au- 0.17 0.092 0.25 0.3 0.11 0.028 0.05 toDim. β - FM; β³ - DeepFM; β - AutoInt; β‘ - DCN-V2. 0.10 0.065 0.12 0.16 0.37 0.14 0.36 0.021 0.05 Comparison with AutoDim. We also compare our 0.0092 0.042 0.077 0.14 0.072 0.016 0.26 method with state-of-art NAS-based method AutoDim with {2, 4, 8} as the candidate embedding size on Figure 5: Cosine similarities between column vectors from Criteo and Avazu datasets in Figure 4. In our method, the embedding table of genre when pretraining DCN-V2 on the total embedding size π is set to be 90, 50 for Criteo and MovieLens-1M with and without SO. Avazu respectively. Our method outperforms AutoDim by finding models with comparably higher AUC and smaller sizes. In addition, from computational perspec- Table 2 tives, our approach HAM has several advantages over The test AUC gain by Algorithm 1 with SO under different AutoDim: (a) HAM has a larger search space due to the target embedding sizes on MovieLens-1M and Avazu structural masking mechanism. For each feature, its can- Test AUC didate embedding size range from 0 to its base dimension Data Size π FM DeepFM AutoInt DCN- ππ . On the contrary, AutoDim requires manually pre- V2 specifying a set of candidate embedding sizes for each 14 +.0006 +.0029 +.0013 +.0012 feature; (b) HAM is more computationally efficient. We 28 +.0018 +.0034 +.0018 +.0027 ML-1M emphasize that AutoDim introduces extra space and time 42 +.0036 +.0033 +.0009 +.0037 complexity by the operations of lifting, batch normaliza- 44 +.0034 +.0039 +.0026 +.0031 Avazu tion, and aggregation for each candidate size while HAM 66 +.0039 +.0044 +.0020 +.0033 only requires extra element-wise product between the The results are obtained by averaging 10 runs and the bold font binary mask and embedding vectors. Moreover, HAM can indicates statistically significant under two-sided t-tests output multiple models of different total embedding sizes (π < 0.05). On the Orthogonal Regularity. We further analyze 5. Conclusions the effect of the orthogonal regularity proposed in our framework. In Figure 5, we visualize the cosine simi- In this paper, we propose a general auxiliary masking larities between column vectors from the embedding of pruning framework to search the embedding sizes for dif- genre when pretraining DCN-V2 on MovieLens-1M with ferent feature fields adaptively in todayβs recommender and without SO where the embedding size is set as 8. It systems. The proposed method leverages the indicator is clear to see that SO helps to reduce the correlations function to search candidate models exactly, which is within those embedding column vectors. Moreover, we efficient and can be easily applied to various models. compare the performance of models selected by our pro- Extensive experiments demonstrate it can effectively re- posed Algorithm 1 with the one without SO in Table 2. move redundant embedding dimensions without great Statistical significant gains in test AUC can be observed performance loss. for MovieLens-1M and Avazu on all base models while the training time per epoch with SO only increases by about 1.6 times compared to the time without SO. Acknowledgments The authors are grateful to anonymous reviewers for 4.5. Discussions and Future Work their constructive comments that greatly improved the presentation of this paper. T. Xiao thanks his Ph.D. advi- Multi-stage Search. As the initial base dimension sor Dr. Krishnakumar Balasubramanian for supports and for each feature fields it tunable, we observe a gain, helpful discussions during the internship at ByteDance. βΌ0.3%, in test AUC from preliminary experiments on MovieLens-1M by searching via HAM with a smaller ini- tial size = 8. This observation confirms the claim made References in recent works [38] that enlarging search space is un- beneficial or even detrimental to existing NAS methods. [1] W. Cheng, Y. Shen, L. Huang, Differentiable neu- To address this issue, our method can also be adapted to ral input search for recommender systems, arXiv design a multi-stage search strategy with a sufficiently preprint arXiv:2006.04466 (2020). larger initial size and stage-wisely prune embeddings [2] X. Zhao, H. Liu, H. Liu, J. Tang, W. Guo, J. Shi, with decreasing target embedding size π . S. Wang, H. Gao, B. Long, Memory-efficient em- Comparison with PEP. To justify the advantages bedding for recommendations, arXiv preprint of our proposed framework, we also discuss the most arXiv:2006.14827 (2020). recent Plug-in Embedding Pruning (PEP) [14] method. [3] S. Zhang, L. Yao, A. Sun, Y. Tay, Deep learning As introduced before, PEP is an unstructured pruning based recommender system: A survey and new method that retains a large number of zero parameters perspectives, ACM Computing Surveys (CSUR) 52 and requires the technique of sparse matrix storage. On (2019) 1β38. the contrary, our method can not only prune the em- [4] P. Covington, J. Adams, E. Sargin, Deep neural net- bedding columns explicitly but also reduce the parame- works for youtube recommendations, in: Proceed- ters in the dense layers while PEP cannot. For example, ings of the 10th ACM conference on recommender HAM (π = 28) can obtain a lighter DCN-V2 model on systems, 2016, pp. 191β198. MovienLens-1M with only βΌ8% number of parameters [5] X. Zhao, C. Wang, M. Chen, X. Zheng, X. Liu, (3,281/40,241) in the dense layers comparing to the model J. Tang, Autoemb: Automated embedding dimen- with a uniform embedding size 16. sionality search in streaming recommendations, Comparison with SSEDS. During the revisions of arXiv preprint arXiv:2002.11252 (2020). this paper, we noticed that a concurrent work [39] also [6] H. Liu, X. Zhao, C. Wang, X. Liu, J. Tang, Auto- proposes to prune embeddings column-wisely as well mated embedding size search in deep recommender as row-wisely using indicator functions. Our work is systems, in: Proceedings of the 43rd International different from theirs in two aspects: (i) we focus on field- ACM SIGIR Conference on Research and Develop- wise search by pruning nearly orthogonal embedding col- ment in Information Retrieval, 2020, pp. 2307β2316. umn vectors; (ii) we use a gradient-descent-based method [7] M. R. Joglekar, C. Li, M. Chen, T. Xu, X. Wang, in the search stage to solve the bilevel problem on the J. K. Adams, P. Khaitan, J. Liu, Q. V. Le, Neural in- training and validation set; the gradient dynamics enable put search for large scale recommendation models, us to re-activate (un-mask) masked embeddings while in: Proceedings of the 26th ACM SIGKDD Interna- SSEDS directly mask all components with small gradient tional Conference on Knowledge Discovery & Data magnitudes. Comparing our method with SSEDS and Mining, 2020, pp. 2387β2397. incorporating our method into SSEDS is our future work. [8] T. Chen, L. Li, Y. Sun, Differentiable product quan- tization for end-to-end embedding compression, in: International Conference on Machine Learning, or propagating gradients through stochastic neu- PMLR, 2020, pp. 1617β1626. rons for conditional computation, arXiv preprint [9] W.-C. Kang, D. Z. Cheng, T. Chen, X. Yi, D. Lin, arXiv:1308.3432 (2013). L. Hong, E. H. Chi, Learning multi-granular quan- [23] E. Jang, S. Gu, B. Poole, Categorical reparameteri- tized embeddings for large-vocab categorical fea- zation with gumbel-softmax, in: International Con- tures in recommender systems, in: Companion ference on Learning Representations, 2016. Proceedings of the Web Conference 2020, 2020, pp. [24] S. Srinivas, A. Subramanya, R. Venkatesh Babu, 562β566. Training sparse neural networks, in: Proceedings [10] A. A. Ginart, M. Naumov, D. Mudigere, J. Yang, of the IEEE conference on computer vision and pat- J. Zou, Mixed dimension embeddings with applica- tern recognition workshops, 2017, pp. 138β145. tion to memory-efficient recommendation systems, [25] M. Ye, D. Choudhary, J. Yu, E. Wen, Z. Chen, J. Yang, in: 2021 IEEE International Symposium on Infor- J. Park, Q. Liu, A. Kejariwal, Adaptive dense-to- mation Theory (ISIT), IEEE, 2021, pp. 2786β2791. sparse paradigm for pruning online recommen- [11] H. Liu, K. Simonyan, Y. Yang, Darts: Differentiable dation system with non-stationary data, arXiv architecture search, in: International Conference preprint arXiv:2010.08655 (2020). on Learning Representations, 2018. [26] S. Rendle, Factorization machines, in: 2010 IEEE In- [12] R. Wang, M. Cheng, X. Chen, X. Tang, C.-J. Hsieh, ternational conference on data mining, IEEE, 2010, Rethinking architecture selection in differentiable pp. 995β1000. nas, arXiv preprint arXiv:2108.04392 (2021). [27] H. Guo, R. Tang, Y. Ye, Z. Li, X. He, Deepfm: a [13] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, factorization-machine based neural network for ctr Q. Le, Understanding and simplifying one-shot prediction, in: Proceedings of the 26th International architecture search, in: International Conference Joint Conference on Artificial Intelligence, 2017, pp. on Machine Learning, PMLR, 2018, pp. 550β559. 1725β1731. [14] S. Liu, C. Gao, Y. Chen, D. Jin, Y. Li, Learnable em- [28] W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang, bedding sizes for recommender systems, in: Inter- J. Tang, Autoint: Automatic feature interaction national Conference on Learning Representations, learning via self-attentive neural networks, in: Pro- 2020. ceedings of the 28th ACM International Conference [15] D. Blalock, J. J. G. Ortiz, J. Frankle, J. Guttag, What is on Information and Knowledge Management, 2019, the state of neural network pruning?, arXiv preprint pp. 1161β1170. arXiv:2003.03033 (2020). [29] R. Wang, B. Fu, G. Fu, M. Wang, Deep & cross [16] P. Savarese, H. Silva, M. Maire, Winning the lottery network for ad click predictions, in: Proceedings with continuous sparsification, in: Advances in of the ADKDDβ17, 2017, pp. 1β7. Neural Information Processing Systems, volume 33, [30] X. He, T.-S. Chua, Neural factorization machines for Curran Associates, Inc., 2020, pp. 11380β11390. sparse predictive analytics, in: Proceedings of the [17] C. Louizos, M. Welling, D. P. Kingma, Learning 40th International ACM SIGIR conference on Re- sparse neural networks through l_0 regularization, search and Development in Information Retrieval, in: International Conference on Learning Represen- 2017, pp. 355β364. tations, 2018. [31] R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin, [18] S. Xie, H. Zheng, C. Liu, L. Lin, Snas: stochastic L. Hong, E. Chi, Dcn v2: Improved deep & cross neural architecture search, in: International Con- network and practical lessons for web-scale learn- ference on Learning Representations, 2018. ing to rank systems, in: Proceedings of the Web [19] Y. Guo, A. Yao, Y. Chen, Dynamic network surgery Conference 2021, 2021, pp. 1785β1797. for efficient dnns, in: Proceedings of the 30th In- [32] N. Bansal, X. Chen, Z. Wang, Can we gain more ternational Conference on Neural Information Pro- from orthogonality regularizations in training deep cessing Systems, 2016, pp. 1387β1395. networks?, Advances in Neural Information Pro- [20] X. Xiao, Z. Wang, Autoprune: Automatic network cessing Systems 31 (2018) 4261β4271. pruning by regularizing auxiliary parameters, Ad- [33] P. RodrΓguez, J. Gonzalez, G. Cucurull, J. M. Gonfaus, vances in Neural Information Processing Systems X. Roca, Regularizing cnns with locally constrained 32 (NeurIPS 2019) 32 (2019). decorrelations, arXiv preprint arXiv:1611.01967 [21] X. Zhou, W. Zhang, H. Xu, T. Zhang, Effective spar- (2016). sification of neural networks with global sparsity [34] S. Yan, Y. Zheng, W. Ao, X. Zeng, M. Zhang, Does constraint, in: Proceedings of the IEEE/CVF Confer- unsupervised architecture representation learning ence on Computer Vision and Pattern Recognition, help neural architecture search?, Advances in Neu- 2021, pp. 3599β3608. ral Information Processing Systems 33 (2020) 12486β [22] Y. Bengio, N. LΓ©onard, A. Courville, Estimating 12498. [35] H. Hazimeh, Z. Zhao, A. Chowdhery, M. Sathi- amoorthy, Y. Chen, R. Mazumder, L. Hong, E. H. Chi, Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning, arXiv preprint arXiv:2106.03760 (2021). [36] P. Molchanov, S. Tyree, T. Karras, T. Aila, J. Kautz, Pruning convolutional neural networks for resource efficient inference, arXiv preprint arXiv:1611.06440 (2016). [37] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, J. Kautz, Importance estimation for neural network pruning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 11264β11272. [38] Y. Ci, C. Lin, M. Sun, B. Chen, H. Zhang, W. Ouyang, Evolving search space for neural architecture search, in: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, 2021, pp. 6659β6669. [39] L. Qu, Y. Ye, N. Tang, L. Zhang, Y. Shi, H. Yin, Single- shot embedding dimension search in recommender system, arXiv preprint arXiv:2204.03281 (2022).