1. Introduction

Field-wise Embedding Size Search via Structural Hard Auxiliary Mask Pruning for Click-Through Rate Prediction

Tesi Xiao

Xia Xiao

Ming Chen

Youlong Cheng

0 0 ByteDance, Mountain View , USA 1 University of California , Davis , USA

Feature embeddings are one of the most important steps when training deep learning based Click-Through Rate prediction models, which map high-dimensional sparse features to dense embedding vectors. Classic human-crafted embedding size selection methods are shown to be “sub-optimal" in terms of the trade-of between memory usage and model capacity. The trending methods in Neural Architecture Search (NAS) have demonstrated their eficiency to search for embedding sizes. However, most existing NAS-based approaches sufer from expensive computational costs, the curse of dimensionality of the search space, and the discrepancy between continuous search space and discrete candidate space. Other methods that prune embeddings in an unstructured manner fail to explicitly reduce the computational costs. In this paper, to address those limitations, we propose a novel strategy that searches for the optimal mixed-dimension embedding scheme by structurally pruning a super-net via Hard Auxiliary Mask. Our method aims to directly search candidate models in the discrete space using a simple and eficient gradient-based method. Furthermore, we introduce orthogonal regularity on embedding tables to reduce correlations within embedding columns and enhance representation capacity. Extensive experiments demonstrate it can efectively remove redundant embedding dimensions without great performance loss.

eol>CTR Prediction Embedding Size Network Pruning Neural Architecture Search

1. Introduction

embedding dimension for all features, either due to the prerequisites of the model input or simply for the sake Deep learning based recommender systems (DLRS) have of convenience. If the embedding dimensions are unidemonstrated superior performance over more tradi- formly high, it leads to increased memory usage and tional recommendation techniques [ 3 ]. The success of computational cost, as it fails to handle the heterogeneity DLRS is mainly attributed to their ability to learn mean- among diferent features. As a concrete example, encodingful representations with categorical features, that sub- ing features with few unique values like gender with large sequently help with modeling the non-linear user-item embedding vectors leads to over-parametrization. Conrelationships eficiently. Indeed, real-world recommenda- versely, the selected embedding size may be insuficient tion tasks usually involve a large number of categorical for highly-predictive features with large cardinalities, feature fields with high cardinality (i.e. the number of such as the user’s last search query. Therefore, finding unique values or vocabulary size) [ 4 ]. One-Hot Encoding appropriate embedding dimensions for diferent feature is a standard way to represent such categorical features. fields is essential.

To reduce the memory cost of One-Hot Encoding, DLRS The existing works towards automating embedding ifrst maps the high-dimensional one-hot vectors into size search can be categorized into two groups: (i) fieldreal-valued dense vectors via the embedding layer. Such wise search [ 5, 2, 6 ]; (ii) vocabulary-wise search [ 7, 8, 9, 10 ]. embedded vectors are subsequently used in predictive The former group aims to assign diferent embedding models for obtaining the required recommendations. sizes to diferent feature fields, while embeddings in the

In this pipeline, the choice of the dimension of the same feature field share the same dimension. The latembedding vectors, also known as embedding dimension, ter further attempts to find diferent embedding sizes plays a crucial role in the overall performance of the to diferent feature values within the same feature field, DLRS. Most existing models assign a fixed and uniform which is generally based on the frequencies of feature DL4SR’22: Workshop on Deep Learning for Search and Recommen- values. Although it has been shown that the latter group dation, co-located with the 31st ACM International Conference on can significantly reduce the model size without a great Information and Knowledge Management (CIKM), October 17-21, 2022, performance drop, the second group of works sufers Atlanta, USA from several challenges and drawbacks (we refer readers * Corresponding author. to Section 3.4 in [ 2 ] for details): (a) a large number of ($X. tXexiaiaoo);@muicndga.cvhise.ned@ub(yTt.eXdaianoc)e;.cxo.xmiax(Mia.oC@hbeynt)e;dance.com unique values in each feature field leads to a huge search youlong.chen@bytedance.com (Y. Cheng) space in which the optimal solution is dificult to find; © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License (b) the feature values frequencies are time-varying and CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) Hard Auxiliary Mask Embedding Vector Embedding Table

! Interaction Layers 1 0 1 1

⊙ -0.6 0.3 -0.2 0.12 Embedding

Look-up not pre-known in real-time recommender system; (c) it is into the same dimension with batch normalization so dificult to handle embeddings with diferent dimensions that they can be aggregated with architecture weights for the same feature field in the state of art DLRS due to of candidate sizes and fed into the subsequent layers to the feature crossing mechanism. As a result, we fix our obtain the prediction. Cheng et al. [ 1 ] also leverage the attention to the field-wise embedding size search in this idea of DARTS but with the architecture weights being work. the weights of sub-blocks of embedding matrices.

The majority of recent works towards automating em- Despite the obtained promising results, the existing bedding size search are based on the trending ideas in methods have certain limitations which arise from the Neural Architecture Search (NAS). NAS has been an ac- discrepancy between the large discrete candidate space tive research area recently, which is composed of three and its continuous relaxation. On the one hand, the elements: (i) a search space of all candidate models; DARTS-based methods in [ 5, 2, 1 ] are algorithmically (ii) a search strategy that goes through models in ; (iii) eficient but search in an augmented continuous space; a performance estimation strategy that evaluates the per- it has been shown that the discrepancy may lead to formance of the selected model. To leverage the NAS ifnding unsatisfactory candidates because the magnitude techniques to search embedding sizes for each feature of architecture weights does not necessarily indicate ifeld, Joglekar et al. [ 7 ] formulate the search space by dis- how much the operation contributes to the overall cretizing an embedding matrix into several sub-matrices performance [12]. On the other hand, while the RL-based and take a Reinforcement Learning (RL) approach with methods in [ 7, 6 ] and the hard selection variant in [ 5 ] the Trainer-Controller framework for search and per- directly search and evaluate models in the original formance estimation, in which the controller samples discrete candidate space, they are computationally heavy diferent embedding blocks to maximize the reward ob- and thus not favored by the large-scale DLRS. Therefore, served from the trainer that trains the model with the a natural question follows: controller’s choices. Liu et al. [ 6 ] also adopt the RL-based search which starts with small embedding sizes and takes Is it possible to come up with an eficient method that actions to keep or enlarge the size based on the predic- searches straight through the discrete candidate space of tion error. Their proposed method requires sequentially all possible embedding sizes? training DRLS and the policy network repeatedly until convergence, which is computationally expensive. In this work, we provide a positive answer by propos

Given the rise of one-shot NAS aiming to search ing an end-to-end field-aware embedding size search and evaluate candidate models together in an over- method leveraging the Straight Through Estimator (STE) parametrized supernet, Zhao et al. [ 5, 2 ] follow the idea of gradients for non-diferentiable functions. Motivated of Diferential Neural Architecture Search (DARTS)[ 11] by one-shot NAS and network pruning, our method seeks and solve a continuous bilevel optimization by the to adaptively search for a good sub-network in a pregradient-based algorithm. To be specific, they define trained supernet by masking redundant dimensions. The a search space that includes all possible embedding sizes mask, named as Hard Auxiliary Mask (HAM), is realof each feature field; for instance in [ 2 ], 5 candidate ized by the indicator function of auxiliary weights. Fursizes {2, 8, 16, 24, 32} are selected for each feature field. thermore, to reduce the information redundancy and These embedding vectors in diferent sizes are then lifted boost the search performance, we impose the orthogonal regularity on embedding tables and train models with regularized loss functions. Contributions and novelty of our work are: (i) we propose to prune embedding tables column-wisely via hard auxiliary masks for the CTR prediction models, which can efectively compress the model; (ii) we ofer a gradient-based pruning method and delve into its dynamics with mathematical insights; (iii) we introduce orthogonal regularity for training recommendation models and experimentally show that orthogonal regularity can help to achieve significant improvement; (iv) we obtain state-of-the-art results on various modern models and our method is scalable on large datasets. 2. Related Work puts, existing works attempt to cast it into a diferentiable problem and provide gradient-based search algorithms. A straightforward method is to directly replace binary values by smooth functions of auxiliary parameters during the forward pass, such as sigmoid [16], Gumbel-sigmoid [17], softmax [11], Gumbel-softmax [18], and piece-wise linear [19]. However, all these methods, which we name Soft Auxiliary Mask (SAM), sufer from the discrepancy caused by continuous relaxation. Other methods, which we refer to Hard Auxiliary Mask (HAM), preserve binary values in the forward pass with indicator functions [20] or Bernoulli random variables [21] and optimize parameters using the straight-through estimator (STE) [22, 23].

A detailed comparison of four representative auxiliary masking methods for approximating binary values is provided in Table 1.

Embedding Size Search for Recommender Systems.

To deal with the heterogeneity among diferent features 3. Methodology and reduce the memory cost of sizable embedding tables, several recent works introduce a new paradigm of mixed- 3.1. Preliminaries dimension embedding tables. In particular, Ginart et al. [10] propose to substitute the uniform-dimension embed- Here we briefly introduce the mechanism of DLRS and dings V ∈ R× with V = EP, where E ∈ R× is define the terms used throughout the paper. the smaller embeddings with ≪ and P ∈ R× is the Model Architecture. Consider the data input projection that lift the embeddings into the base dimen- involves features fields from users, items, their sion for feature crossing purposes. A popularity-based interactions, and contextual information. We derule is further designed to find the dimension for each note these raw features by multiple one-hot vectors feature field which is too rough and cannot handle impor- x1 ∈ R1 , . . . , x ∈ R , where field dimensions tant features with low frequency. In addition, plenty of 1, . . . are the cardinalities of feature fields 1. Arworks seek to provide an end-to-end framework leverag- chitectures of DLRS often consist of three key coming the advances in NAS, including RL approaches [ 7, 6 ] ponents: (i) embedding layers with tables V1 ∈ and diferentiable NAS methods [ 5, 13, 2, 1 ]. The most R1× 1 , . . . , V ∈ R × that map sparse one-hot relevant work to us is [ 1 ], which employs a soft auxiliary vectors to dense vectors in a low dimensional embedding mask to prune the embeddings structurally. However, it space by v = V⊤x; (ii) feature interaction layers that requires magnitude-based pruning to derive fine-grained model complex feature crossing using embedding vecmixed embedding, in which the discrepancy between the tors; (iii) output layers that make final predictions for continuous relaxed search space and discrete candidate specific recommendation tasks. space could lead to a relatively great loss in performance. The feature crossing techniques of existing models The most recent work [14] proposes a Plug-in Embed- fall into two types - vector-wise and bit-wise. Models ding Pruning (PEP) approach to prune embedding tables with vector-wise crossing explicitly introduce interacinto sparse matrices to reduce storage costs, Albeit sig- tions by the inner product, such as Factorization Machine nificantly reducing the number of non-zero embedding (FM) [26], DeepFM [27] and AutoInt [28]. The bit-wise parameters, this type of unstructured pruning method crossing, in contrast, implicitly adds interaction terms fails to explicitly reduce the embedding size. by element-wise operations, such as the outer product

Neural Network Pruning via Auxiliary Masks. To in Deep Cross Network (DCN) [29], and the Hadamard deploy deep learning models on resource-limited devices, product in NFM [30] and DCN-V2 [31]. In this work, we there is no lack of work in neural network pruning; see deploy our framework to the Click-Through Rate (CTR) [15] and the reference therein. One popular approach prediction problem with four base models: FM, DeepFM, towards solving this problem is to learn a pruning mask AutoInt, and DCN-V2. through auxiliary parameters, which considers pruning Notations. Throughout this paper, we use ⊙ for as an optimization problem that tends to minimize the the Hadamard (element-wise) product of two vectors. supervised loss of the masked neural network with a The indicator function of a scalar-valued is defined certain sparsity constraint. As learning the optimal mask as 1> 0 = 1 if > 0; 1> 0 = 0 if ≤ 0. The is indeed a discrete optimization problem for binary in

1Numeric features are converted into categorical data by binning.

indicator function of a vector ∈ R is defined as 1 >0 = [1 1>0, . . . , 1 >0]⊤. The identity matrix is written as I and the function diag(· ) returns a diagonal matrix with its diagonal entries as the input. We use ‖ · ‖ 1, ‖ · ‖ for the ℓ1 norm of vectors and the Frobenius norm of matrices respectively. probability score ˜ is given as follows: ˜ = (v ⊙ 1 >0|Θ) = (x| V˜ , Θ) = (x| , V, Θ), where V˜ = {V˜ 1 1 , . . . , V˜ } is the pruned embedding layer with ˜ = V diag(1 >0), = 1, . . . .

are pruned We emphasize that embedding tables V˜ 3.2. Background column-wisely in a structural manner unlike PEP [14] In general, the CTR prediction models takes the concate- that prunes entry-wisely. nation of all feature fields from a user-item pair, denoted Orthogonal Regularity Given the embedding table V ∈ R × for feature field , its column vectors, deby x = [x1; x2; . . . ; x ] as the input vector. Given the embedding layer V = {V1, V2, . . . , V }, the feature noted by V,1, . . . , V, ∈ R , can be regarded as interaction layers take the embedding vectors v and feed diferent representations of feature in the embedding the learned hidden states into the output layer to obtain space. The auxiliary masks above aim to mask relatively the prediction. The embedding vectors v are the dense uninformative column vectors so as to reduce the model encoding of the one-hot input x that can be defined as size. Nonetheless, the presence of correlation between follows: these vectors may complicate the selection procedure.

Specifically, presuming that the most predictive one V, v = [v1; v2; . . . ; v ] has been selected, it would be problematic if we greedily = [︁V1⊤x1; V2⊤x2; . . . ; V⊤ x ]︁ := x, sderloepctwthheenneixntclcuodluedmninV.F,ortihnasttabnricneg,sifthVel,ar g̸⊥estVl o,ss, where is the embedding lookup operator. The predic- we have the following decomposition: V, = p + p⊥, tion score ˆ is then calculated with models’ other param- where p = V, ‖ V, and p⊥ ⊥ p. Therefore, it eters Θ in the feature interaction layers and output layer would be dificult to determine whether the increments by ˆ = (v|Θ) = (x|Θ) = (x|V, Θ), where are attributed to the existing direction p or the new facˆ ∈ [ 0, 1 ] is the predicted probability, (·| Θ) is the pre- tor p⊥. To address this issue, we follow [32] to train diction function of embedding vectors, and = ∘ is embedding parameters V with Soft Orthogonal (SO) regthe prediction function of raw inputs. To learn the model ularizations: tphaeratmraeint einrsg Vda,tΘa,,i.we.e, aim to minimize the Log-loss on ℛ(V) = ∑︁ ‖V⊤V − I‖2 /2 , (1) mV,iΘn ℒtrain(V, Θ)

1 ∑︁ [︀ log(ˆ ) + (1 − ) log(1 − ˆ )︀] := −

=1 where is the total number of training samples.

Hard Auxiliary Mask As illustrated in Figure 1, we add auxiliary masks for each embedding dimension slot.

Specifically, the auxiliary masks are indicator functions of auxiliary parameters = [ 1; 2; . . . ; ], where ∈ R is in the same size of the corresponding embedding vector v. Provided with the masks, the predicted =1 where divisors 2 are introduced to handle heterogeneous dimensionality of embedding tables. We also adopt a relaxed SO regularization in which V replaced by the normalized matrix V with unit column vectors, which corresponds to the pair-wise cosine similarities within embedding columns [33]. 3.3. Framework

Our proposed framework is motivated by one-shot NAS,

which consists of three stages: pretrain, search, and retrain.

Pretrain. As shown in [34], pre-training architecture Early Stopper for the Search Stage. We can deter

representations improve the downstream architecture mine to stop the search stage if the number of positive search eficiency. Therefore, in the pretraining stage, we auxiliary weights is close to the target size with no signiftrain the base model with a large embedding layer V. icant gain in validation AUC since the sign of auxiliary The base dimension for each feature field is deterweights exactly indicates whether the corresponding emmined by prior knowledge. In addition, the embedding bedding components are pruned or not. On the contrary, dimension should not exceed the field dimension

to avoid column-rank-deficiency. The SO regularization it is hard to deploy the early stopper for the search stage using other auxiliary mask pruning methods because the term (1) is added to the mini-batch training loss for the value of validation AUC during the search epochs is unoptimizer to learn near-orthogonal embeddings. The able to represent the performance of the model selected learned model parameters, denoted by Vinit, Θinit, are in the retraining stage. passed to the search stage as initialization.

Retrain. After training { , V, Θ} for several epochs

Search. Provided with the pre-trained model, the goal in the search step, we obtain a binary mask based on of the search stage is to find the column-wise sparse the sign of embedding layer that preserves model accuracy, which vergence. The early stopper is deployed for both preand continue optimizing {V, Θ} till concan be formulated as: min min

V,Θ ℒtrain ︁( V˜ , Θ ︁) + ⃒⃒ ‖1 >0‖1 − ⃒⃒ , training and retraining steps, which terminates training if model performance on validation data cannot be improved within a few epochs. The overall framework is where ‖1 >0‖1 counts the number of non-zero embed- described in Algorithm 1. ding columns and is the target number of non-zero columns. Note that instead of direct regularization on ‖1 >0‖1 as [16, 20, 25] do, we include the target number to reduce instability from batched training and the choice of hyperparameter . However, given that the objective function above is non-diferentiable when

= 0 and has a zero gradient anywhere else, traditional gradient descent methods are not applicable. To that end, we use the straight-through estimator (STE) [22], which replaces the ill-defined gradient in the chain rule with a fake gradient. Despite various smooth alternatives used in the literature, such as sigmoid [16] and piecewise polynomials [ 35 ], we adopt the simplest identity function for backpropagation, i.e., ℒ =

ℒ 1 >0 1 >0 ≈ ℒ

1 >0 =

ℒ 1 >0

. (2) Then, the search stage starts with the unmasked model that 0 = · ⃗1 for some small > update rules for at iteration are given by:

0. The gradient +1 = − ·∇ (1 >0)ℒbatch− · sign(‖1 >0‖1− )⃗1, where is the learning rate. We will illustrate in Section 3.4 that the above updates enjoy certain benefits and the last term plays an important role by pushing the optimizer iteratively to evaluate candidate models with a hard auxiliary mask. Furthermore, to enhance the stability and performance, we implement a multi-step training through iteratively training auxiliary parameters on validation data and retraining the original weights, which attempts to solve the following bi-level optimization problem: min s.t.

ℒval V˜⋆ , Θ ︁( ˜ ⋆

V , Θ ⋆ ︁)

+ ⃒⃒ ‖1 >0‖1 − ⃒⃒ ⋆ = arg min 3.4. On the Gradient Dynamics As finding the optimal mask is essentially a combinatorial optimization problem over the set of 2 possible status of on-of switches ( is the number of switches), the proposed gradient-based search rule given in (3) provides an alternative approach to look through the discrete candidate sets in a continuous space. We elaborate below that the STE-based gradient in (2) is related to the Taylor approximation for the Leave-One-Out error and the penalty term drifts auxiliary variables to find a model of end end

Algorithm 1: Structural HAM Pruning

base dimensions (≤ Input: data train, val, test, target embedding size , hyperparameters

) for each field , , >

0; ◁ Pretrain: while stopping criteria not met do obtain a batch of samples from train and update V, Θ regularized with SO (1) by certain optimizer; ◁ Search: initialize

= ⃗1; while stopping criteria not met do update

by Eq. (3); obtain a batch of samples from val and obtain a batch of samples from train and update V, Θ by certain optimizer; (3) ◁Retrain: mask the dimensions with 0 where

< 0 and retrain the model until the stopping criteria are met. the target size.

For illustration purpose, we start with the dynamics of the scalar-valued parameter , , the -th element of , that controls the mask for the -th column vector, V, , of the embedding table V. Define the function ℓ() := ℒ[· V, ] as the value of loss function when replacing V, by · V, (0 ≤ ≤ 1). By the first-order Taylor approximation, we have ℓ(1) − ℓ(0) ≈ ℓ′(0) and ℓ(0) − ℓ(1) ≈ − ℓ′(1). Moreover, it is worth noting that the STE-based gradient in (2) with regards to , is exactly ℓ′(1 , >0), i.e., if # of positive )s < external = > 0 masked 0 - V!$,#$ α!$,#$ 0 velocity of α!,# ≈ - (ℒ[%&'!,#] −ℒ['!,#]) α!,#

un-masked V!,# if # of positive )s > external = − < 0 1 , >0 (4) the literature in terms of prediction ability, and memory consumption? (iii) How does the soft orthogonality regu= ℒ[V, ] − ℒ [0· V, ]. larity boost the performance of our proposed framework? In other words, the proposed gradient calculates the im- In the following, we will first introduce the experimental portance (with batches of data) of the embedding com- setups including datasets, baselines, and implementation ponent V, using the first-order Taylor approximation. details, and then present results as well as discussions. The above heuristics are also mentioned in [ 36, 37, 25 ].

We now consider the dynamics of all the auxiliary 4.1. Datasets and Data Preparation. parameters , ’s provided by (3). As illustrated above, the term ∇(1 >0)ℒdata measures the importance of each embedding component by approximating the diference between the loss value with un-masked embeddings and the loss value with a mask on. Furthermore, the sign of the penalty term in (3) is determined by the comparison between the number of un-masked components and the target size . As an analogue, the dynamics of auxiliary parameters can be viewed as a particle system in a neighborhood of 0 on real line, in which the particles are initialized at the same point > 0 (i.e., the model starts with all embeddings un-masked) and the velocity

We use three benchmark datasets in our experiments: (i)

MovieLens-1M: This dataset contains users’ ratings (15) on movies. We treat samples with ratings greater than 3 as positive samples and samples with ratings below 3 as negative samples. Neutral samples with a rating equal to 3 are dropped; (ii) Criteo: This dataset has 45 million users’ clicking records on displayed ads. It contains 26 categorical feature fields and 13 numerical feature ifelds; (iii) Avazu: This dataset has 40 million clicking records on displayed mobile ads. It has 22 feature fields spanning from user/device features to ad attributes. We follow the common approach to remove the infrequent features and discretize numerical values. First, we remove the infrequent feature values and treat them as a single value “unknown", where the threshold is set to {10, 4} for Criteo, and Avazu respectively. Second, we convert all numerical values in into categorical values by transforming a value to int(log()2) if int() > 2 and to int() − 2 otherwise, which is proposed by the winner of Criteo Competition 2. Third, we randomly split all training samples into 80% for training, 10% for validation, and 10% for testing. of each particle is roughly · (ℒ[0· V, ] − ℒ [V, ]) by the approximation in (4). In addition, an external force is introduced by the penalty term, ⃒⃒ ‖1 >0‖1 − ⃒⃒ , which can be interpreted as the wind flow with its velocity being − < 0 when the number of positive particles is larger than and being > 0 otherwise. As a result, when the number of positive particles exceeds , those particles with smaller ℒ[0· V, ] − ℒ [V, ] tend to be negative, and vice versa; see Figure 2 for further illustration of the proposed algorithm.

4. Experiments

To validate the performance of our proposed framework, we conduct extensive experiments on three real-world recommendation datasets. Through the experiments, we seek to answer the following three questions: (i) How does our proposed method, HAM, perform compared with other auxiliary mask pruning (AMP) methods? (ii) How does our proposed framework perform compared with the existing field-wise embedding size search methods in 4.2. Baselines We compare our proposed pruning method HAM with several baseline methods below. • Uniform: The method assigns a uniform embedding size for all feature fields; • Auxiliary Mask Pruning: We compare our proposed method with other common approaches that apply an

2https://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf

DNIS [ 1 ]. auxiliary mask to the embeddings. To be specific, we select three representative types of the auxiliary mask listed in Table 1 and apply the same one-shot NAS framework as described in Algorithm 1: (i) SAM: the embeddings are masked directly by auxiliary weights 0 ≤ ≤ 1, which is equivalent to (ii) SAM-GS: the auxiliary mask is obtained by the

Gumbel-sigmoid function of auxiliary parameters

∈ (0, 1) and ∼

Uniform(0,1) be;ow Sigmoid[(log( ) + log(

))/ ].

1 −

1 − (iii) HAM-p: the mask is generated by Bernoulli random variables with parameters ’s when forwarding the network. The gradient for ’s is calculated by STE. In the retraining stage, we keep the embedding components with top- auxiliary weights for fair comparisons. • AutoDim [ 2 ]: the state-of-the-art NAS-based method in the literature, which outperforms a list of search methods [ 8, 10, 5, 7 ]; see Section 3.4 in [ 2 ]. 4.3. Implementation Details

Base Model architecture. We adopt four representa

tive CTR-prediction models in the literature: FM [26], DeepFM [27], AutoInt [28], DCN-V2 [31], as our base models to compare their performance. For FM, DeepFM, and AutoInt, we add an extra embedding vector layer between the feature interaction layer and the embedding layer as proposed in [10]. Linear transformations (without bias) are applied to mixed-dimension embeddings so that the transformed embedding vectors are of the same size, which is set as 16 in our experiments. This is not required for DCN-V2 due to the bit-wise crossing mechanism. In the pretraining stage, the base dimension for each feature field is chosen as min(16, ) where is the field dimension. erature to employ Adam optimizer with the learning rate of 10− 3 to optimize model parameters in all three stages and use SGD optimizer to optimize auxiliary weights in the search stage. To fairly compare diferent ods, the number of search epochs is fixed as 10, and the learning rates of auxiliary parameters are chosen from the search grid {1, 10− 1, 10− 2, 10− 3}, and the temperature of the sigmoid function is chosen from the search grid {1, 10− 1, 10− 2}. To obtain the best results of each method, we pick 10− 2 as the learning rate of SGD for SAM, SAM-p, and HAM-p, 10− 3 for our method HAM. In HAM, the initial value of auxiliary weights = 0.01, and the hyperparameter = 5 × 10− 5. For the orthogonal regularity, we employ ℛ(V) with = 10− 3 for training models on MovieLens-1M, and use the pair-wise cosine similarities ℛ( V¯) with use PyTorch to implement our method and train it with mini-batch size 2048 on a single 16G-Memory Nvidia = 10− 6 on Avazu. We

Tesla V100.

4.4. Performance Comparison

We next present detailed comparisons with other meth

ods. We adopt AUC (Area Under the ROC Curve) on test datasets to measure the performance of models and the number of embedding parameters to measure memory usage.

Comparison with other AMP Methods. We first

compare our proposed AMP methods to other common AMP methods listed in Table 1 on MovieLens-1M. To fairly compare the performance, we not only consider comparing the test AUC under the same target total embedding size but also take the number of model parameters into account. In Figure 3, we report the average values of test AUC under the same total embedding size and the best results among 10 independent runs in terms of AUC and plot Test AUC - # Parameter curve on

MovieLens-1M. For each method, three measurements

are provided from left to right with the target total embedding size = 14, 28, 42 respectively. We drop the Test AUC - # Parameter curve of the method with the worst performance at the bottom. We observe that: (a) ifxing the target total embedding size, we can tell that our method HAM outperforms all other methods on four base models with regard to the value of test AUC. As illustrated in Test AUC - # Parameter curves, with the same budget of total embedding size, HAM tends to assign larger sizes to informative feature fields with larger vocabulary sizes compared with other methods. This should be attributed to the gradient dynamics described in Section 3.4 which helps to identify the important embedding components. As a result, HAM obtains models with higher AUC under the same total embedding size; (b) it is interesting to observe that SAM also assigns larger embedding other methods with the same total embedding size, which verify the findings in [ 12] that the magnitudes of auxiliary weights may not indicate the importance; (c) HAM ture crossing models (FM, DeepFM, AutoInt), but also on the bit-wise feature crossing model (DCN-V2) where

Uniform serves as a strong baseline. The performance

of HAM should be credited to the hard selection caused by indicator functions. With the deterministic binary mask, the masked model evaluated in the search stage is equivalent to the selected model in the retraining stage, as illustrated in Table 1. As a result, we conclude that

HAM exhibits stable performance on all four base models

AMP meth- exhibits its superiority, not only on the vector-wise feaOptimization. We follow the common setup in the lit- sizes to relatively un-informative features compared to DeepFM

AutoInt

s=28 Total Embedding Size

s=42 DeepFM s=14

s=28 Total Embedding Size

s=42 AutoInt s=14

s=28 Total Embedding Size

s=42 DCN-V2

Uniform SAM SAM-GS HAM-p

HAM

s=28 Total Embedding Size

s=42

Comparison with AutoDim. We also compare our 0.0092 0.042 0.077 0.14 0.072 0.016 0.26 method with state-of-art NAS-based method AutoDim with {2, 4, 8} as the candidate embedding size on Figure 5: Cosine similarities between column vectors from Criteo and Avazu datasets in Figure 4. In our method, the embedding table of genre when pretraining DCN-V2 on the total embedding size is set to be 90, 50 for Criteo and MovieLens-1M with and without SO. Avazu respectively. Our method outperforms AutoDim by finding models with comparably higher AUC and smaller sizes. In addition, from computational perspec- Table 2 tives, our approach HAM has several advantages over The test AUC gain by Algorithm 1 with SO under diferent AutoDim: (a) HAM has a larger search space due to the target embedding sizes on MovieLens-1M and Avazu structural masking mechanism. For each feature, its candidate embedding size range from 0 to its base dimension Data Size FM DeeTpeFsMtAUAuCtoInt DCN . On the contrary, AutoDim requires manually pre- V2 specifying a set of candidate embedding sizes for each 14 +.0006 +.0029 +.0013 +.0012 feemaptuhraes;iz(eb)thHaAtMAuistomDoirme icnotmropduutcaetsioenxtarlalyspeaficcieenatn. dWtieme ML-1M 4228 ++..00003168 ++..00003334 ++..00000198 ++..00003277 tcioomn,palenxditaygbgyretghaetioopnerfoatrioeancshofcalinftdiindga,tbeasticzhe nwohrimleaHliAzaM- Avazu 6464 ++..00003394 ++..00004349 ++..00002206 ++..00003331 only requires extra element-wise product between the The results are obtained by averaging 10 runs and the bold font binary mask and embedding vectors. Moreover, HAM can indicates statistically significant under two-sided t-tests output multiple models of diferent total embedding sizes ( < 0.05).

5. Conclusions

On the Orthogonal Regularity. We further analyze the efect of the orthogonal regularity proposed in our framework. In Figure 5, we visualize the cosine simi- In this paper, we propose a general auxiliary masking larities between column vectors from the embedding of pruning framework to search the embedding sizes for difgenre when pretraining DCN-V2 on MovieLens-1M with ferent feature fields adaptively in today’s recommender and without SO where the embedding size is set as 8. It systems. The proposed method leverages the indicator is clear to see that SO helps to reduce the correlations function to search candidate models exactly, which is within those embedding column vectors. Moreover, we eficient and can be easily applied to various models. compare the performance of models selected by our pro- Extensive experiments demonstrate it can efectively reposed Algorithm 1 with the one without SO in Table 2. move redundant embedding dimensions without great Statistical significant gains in test AUC can be observed performance loss. for MovieLens-1M and Avazu on all base models while the training time per epoch with SO only increases by Acknowledgments about 1.6 times compared to the time without SO. 4.5. Discussions and Future Work

Multi-stage Search. As the initial base dimension

for each feature fields it tunable, we observe a gain, ∼ 0.3%, in test AUC from preliminary experiments on MovieLens-1M by searching via HAM with a smaller initial size = 8. This observation confirms the claim made in recent works [ 38 ] that enlarging search space is unbeneficial or even detrimental to existing NAS methods. To address this issue, our method can also be adapted to design a multi-stage search strategy with a suficiently larger initial size and stage-wisely prune embeddings with decreasing target embedding size .

Comparison with PEP. To justify the advantages of our proposed framework, we also discuss the most recent Plug-in Embedding Pruning (PEP) [14] method. As introduced before, PEP is an unstructured pruning method that retains a large number of zero parameters and requires the technique of sparse matrix storage. On the contrary, our method can not only prune the embedding columns explicitly but also reduce the parameters in the dense layers while PEP cannot. For example, HAM ( = 28) can obtain a lighter DCN-V2 model on MovienLens-1M with only ∼ 8% number of parameters (3,281/40,241) in the dense layers comparing to the model with a uniform embedding size 16.

Comparison with SSEDS. During the revisions of this paper, we noticed that a concurrent work [ 39 ] also proposes to prune embeddings column-wisely as well as row-wisely using indicator functions. Our work is diferent from theirs in two aspects: (i) we focus on fieldwise search by pruning nearly orthogonal embedding column vectors; (ii) we use a gradient-descent-based method in the search stage to solve the bilevel problem on the training and validation set; the gradient dynamics enable us to re-activate (un-mask) masked embeddings while SSEDS directly mask all components with small gradient magnitudes. Comparing our method with SSEDS and incorporating our method into SSEDS is our future work.

The authors are grateful to anonymous reviewers for

their constructive comments that greatly improved the presentation of this paper. T. Xiao thanks his Ph.D. advisor Dr. Krishnakumar Balasubramanian for supports and helpful discussions during the internship at ByteDance. in: International Conference on Machine Learning, or propagating gradients through stochastic neuPMLR, 2020, pp. 1617–1626. rons for conditional computation, arXiv preprint [9] W.-C. Kang, D. Z. Cheng, T. Chen, X. Yi, D. Lin, arXiv:1308.3432 (2013).

L. Hong, E. H. Chi, Learning multi-granular quan- [23] E. Jang, S. Gu, B. Poole, Categorical reparameteritized embeddings for large-vocab categorical fea- zation with gumbel-softmax, in: International Contures in recommender systems, in: Companion ference on Learning Representations, 2016. Proceedings of the Web Conference 2020, 2020, pp. [24] S. Srinivas, A. Subramanya, R. Venkatesh Babu, 562–566. Training sparse neural networks, in: Proceedings [10] A. A. Ginart, M. Naumov, D. Mudigere, J. Yang, of the IEEE conference on computer vision and patJ. Zou, Mixed dimension embeddings with applica- tern recognition workshops, 2017, pp. 138–145. tion to memory-eficient recommendation systems, [25] M. Ye, D. Choudhary, J. Yu, E. Wen, Z. Chen, J. Yang, in: 2021 IEEE International Symposium on Infor- J. Park, Q. Liu, A. Kejariwal, Adaptive dense-tomation Theory (ISIT), IEEE, 2021, pp. 2786–2791. sparse paradigm for pruning online recommen[11] H. Liu, K. Simonyan, Y. Yang, Darts: Diferentiable dation system with non-stationary data, arXiv architecture search, in: International Conference preprint arXiv:2010.08655 (2020).

on Learning Representations, 2018. [26] S. Rendle, Factorization machines, in: 2010 IEEE In[12] R. Wang, M. Cheng, X. Chen, X. Tang, C.-J. Hsieh, ternational conference on data mining, IEEE, 2010, Rethinking architecture selection in diferentiable pp. 995–1000.

nas, arXiv preprint arXiv:2108.04392 (2021). [27] H. Guo, R. Tang, Y. Ye, Z. Li, X. He, Deepfm: a [13] G. Bender, P.-J. Kindermans, B. Zoph, V. Vasudevan, factorization-machine based neural network for ctr Q. Le, Understanding and simplifying one-shot prediction, in: Proceedings of the 26th International architecture search, in: International Conference Joint Conference on Artificial Intelligence, 2017, pp. on Machine Learning, PMLR, 2018, pp. 550–559. 1725–1731. [14] S. Liu, C. Gao, Y. Chen, D. Jin, Y. Li, Learnable em- [28] W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang, bedding sizes for recommender systems, in: Inter- J. Tang, Autoint: Automatic feature interaction national Conference on Learning Representations, learning via self-attentive neural networks, in: Pro2020. ceedings of the 28th ACM International Conference [15] D. Blalock, J. J. G. Ortiz, J. Frankle, J. Guttag, What is on Information and Knowledge Management, 2019, the state of neural network pruning?, arXiv preprint pp. 1161–1170.

arXiv:2003.03033 (2020). [29] R. Wang, B. Fu, G. Fu, M. Wang, Deep & cross [16] P. Savarese, H. Silva, M. Maire, Winning the lottery network for ad click predictions, in: Proceedings with continuous sparsification, in: Advances in of the ADKDD’17, 2017, pp. 1–7.

Neural Information Processing Systems, volume 33, [30] X. He, T.-S. Chua, Neural factorization machines for Curran Associates, Inc., 2020, pp. 11380–11390. sparse predictive analytics, in: Proceedings of the [17] C. Louizos, M. Welling, D. P. Kingma, Learning 40th International ACM SIGIR conference on Resparse neural networks through l_0 regularization, search and Development in Information Retrieval, in: International Conference on Learning Represen- 2017, pp. 355–364.

tations, 2018. [31] R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin, [18] S. Xie, H. Zheng, C. Liu, L. Lin, Snas: stochastic L. Hong, E. Chi, Dcn v2: Improved deep & cross neural architecture search, in: International Con- network and practical lessons for web-scale learnference on Learning Representations, 2018. ing to rank systems, in: Proceedings of the Web [19] Y. Guo, A. Yao, Y. Chen, Dynamic network surgery Conference 2021, 2021, pp. 1785–1797. for eficient dnns, in: Proceedings of the 30th In- [32] N. Bansal, X. Chen, Z. Wang, Can we gain more ternational Conference on Neural Information Pro- from orthogonality regularizations in training deep cessing Systems, 2016, pp. 1387–1395. networks?, Advances in Neural Information Pro[20] X. Xiao, Z. Wang, Autoprune: Automatic network cessing Systems 31 (2018) 4261–4271. pruning by regularizing auxiliary parameters, Ad- [33] P. Rodríguez, J. Gonzalez, G. Cucurull, J. M. Gonfaus, vances in Neural Information Processing Systems X. Roca, Regularizing cnns with locally constrained 32 (NeurIPS 2019) 32 (2019). decorrelations, arXiv preprint arXiv:1611.01967 [21] X. Zhou, W. Zhang, H. Xu, T. Zhang, Efective spar- (2016).

sification of neural networks with global sparsity [34] S. Yan, Y. Zheng, W. Ao, X. Zeng, M. Zhang, Does constraint, in: Proceedings of the IEEE/CVF Confer- unsupervised architecture representation learning ence on Computer Vision and Pattern Recognition, help neural architecture search?, Advances in Neu2021, pp. 3599–3608. ral Information Processing Systems 33 (2020) 12486– [22] Y. Bengio, N. Léonard, A. Courville, Estimating 12498.

[1]

Cheng ,

Shen ,

Huang , Diferentiable neural input search for recommender systems , arXiv preprint arXiv: 2006 . 04466 ( 2020 ).

[2]

Zhao ,

Liu , H. Liu,

Tang ,

Guo ,

Shi ,

Wang ,

Gao ,

Long , Memory-eficient embedding for recommendations , arXiv preprint arXiv: 2006 . 14827 ( 2020 ).

[3]

Zhang ,

Yao ,

Sun ,

Tay , Deep learning based recommender system: A survey and new perspectives, ACM Computing Surveys (CSUR) 52 ( 2019 ) 1 - 38 .

[4]

Covington ,

Adams , E. Sargin, Deep neural networks for youtube recommendations , in: Proceedings of the 10th ACM conference on recommender systems , 2016 , pp. 191 - 198 .

[5]

Zhao ,

Wang ,

Chen ,

Zheng ,

Liu ,

Tang , Autoemb: Automated embedding dimensionality search in streaming recommendations , arXiv preprint arXiv: 2002 . 11252 ( 2020 ).

[6]

Liu ,

Zhao ,

Wang ,

Liu ,

Tang , Automated embedding size search in deep recommender systems , in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , 2020 , pp. 2307 - 2316 .

[7]

M. R.

Joglekar ,

Li ,

Chen ,

Xu ,

Wang ,

J. K.

Adams ,

Khaitan ,

Liu ,

Q. V.

Le , Neural input search for large scale recommendation models , in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2020 , pp. 2387 - 2397 .

[8]

Chen ,

Li ,

Sun , Diferentiable product quantization for end-to-end embedding compression,

[35]

Hazimeh ,

Zhao ,

Chowdhery ,

Sathiamoorthy ,

Chen ,

Mazumder ,

Hong ,

E. H.

Chi , Dselect-k: Diferentiable selection in the mixture of experts with applications to multi-task learning , arXiv preprint arXiv:2106.03760 ( 2021 ).

[36]

Molchanov ,

Tyree ,

Karras ,

Aila ,

Kautz , Pruning convolutional neural networks for resource eficient inference , arXiv preprint arXiv:1611.06440 ( 2016 ).

[37]

Molchanov ,

Mallya ,

Tyree , I. Frosio ,

Kautz , Importance estimation for neural network pruning , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019 , pp. 11264 - 11272 .

[38]

Ci ,

Lin ,

Sun ,

Chen ,

Zhang , W. Ouyang, Evolving search space for neural architecture search , in: Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021 , pp. 6659 - 6669 .

[39]

Qu ,

Ye ,

Tang ,

Zhang ,

Shi ,

Yin , Singleshot embedding dimension search in recommender system , arXiv preprint arXiv:2204.03281 ( 2022 ).